Press "Enter" to skip to content

Curated SQL Posts

A Primer on Stan

Jack Kennedy explains the concepts of Stan and JAGS:

You may have used a probabilistic programming language (PPL) in the past, such as BUGS, to perform Bayesian inference. You’ve heard about Stan and want to learn a little more. Or maybe you’re about to step into the Bayesian paradigm and don’t know where to start. You want to know whether you should make the switch from JAGS to Stan, or you’ve used neither of JAGS or Stan and want to know which will suit you best. This post will focus solely on the differences between JAGS and Stan as I have experience with both of them, but there are many more PPLs out there. For example, I have never used Bean Machine, but of all the PPLs, it certainly takes the crown for best name.

Stan has been on my to-learn list for a while and I did successfully get one of my employees (a rassa-frassin’ frequentist) to use and enjoy the power of Bayesian analysis. One of these days, I’ll have to get back to it.

Comments closed

So You Have a Programming Question

Louis Davidson takes us through the process of asking a good question:

Over the past 25 years, I have answered a lot of programming questions in online forums, from co-workers, and from friends. It has been a while since I had been around forums, but I recently decided it was time to get back to what started me in the SQL community: answering questions. Not only is it complementary to my current job as Simple-Talk editor, it is really a great joy to be able to help other people with their problems. It is also educational to see the kinds of things other people are going through before you also go through them.

Sometimes you know the answer right away, but sometimes the question is a challenge (these are actually my favorite.) At times I have spent hours on a question, trying to figure out a solution that works, and then writing up the answer to post back. Along the way I am learning something new and (assuming I figured out a correct answer,) the question poster usually learns something too. I mean, I have been wrong a few times, at least.

It’s been a decade since I’ve haunted Q&A forums but yeah, answering questions is a good way to learn more. Still, it can be really frustrating for everybody when you get (or ask) a vague question: there’s a lot of opportunity for confusion, non-viable answers, or even people ignoring the question completely because they can’t make sense of it.

Comments closed

Disabling Public Network Access in Synapse

Ryan Adams builds a private endpoint:

If you disable public access to your Azure Synapse Workspace you will get the following error message when attempting to open Synapse Studio. 

“Failed to load one or more resources due to forbidden issue, error code 403.” 

Click through for more information about routing for Synapse resources and what you’d need to do in order to disable public network access entirely.

Comments closed

Power BI Incremental Refresh with Non-Standard Dates in Parquet Files

Shabnam Watson hits on a specific but interesting use case:

The most common scenario for setting up the out of the box incremental refresh in Power BI is to base it off of a datetime column; however, there are cases when you may want to set up incremental refresh based off of a column with a data type other than datetime. Examples are when you are working with a smart date ID (01012023 for Jan 1,2023) column or when you are working with a source system that has partitioned data using a column such as Year that has a numeric data type.

A use case for the latter scenario is when you are working with Parquet/Delta files via Azure Synapse Analytics Serverless SQL Pool. When working with larger datasets, it is typical to see the Parquet/Delta files partitioned by date ranges. Depending on how much data there is, the partitioning may be at the Year level instead of Day.

With that scenario in mind, read on to learn how you can minimize your Power BI processing time and costs when doing incremental refresh.

Comments closed

Data Protection in Azure

Deepthi Goguri enumerates the ways:

Data needs to be protected no matter where it lives, On-prem or in Azure. Data can be protected by using the encryption that Azure provides. What are the types of encryption we have in Azure?

In addition to specific encryption options, Deepthi also provides an overview of Dynamic Data masking and Ledger tables.

Comments closed

Adaptive Protection and More in Microsoft Purview

Victoria Holt summarizes some Microsoft Purview updates:

 Yesterday a lot of changes were announced for Microsoft Purview at the Microsoft security event securing your data with a multilayered defense.  Microsoft Purview is about managing data security risks across hybrid multi-cloud data estates that have a defense in depth strategy to mitigate risk. The recording can be watched at this link

Read on for a summary of what went down.

Comments closed

How Power BI Chooses Colors for Legends

Allison Kennedy reveals a mystery of the universe:

I’ve just had a wonderful discovery about why Power BI sometimes seems to choose random colors in the legend. 

Typically, the first item in a series will match the first color of your Power BI theme, the second item in the series will match the second color of your Power BI theme, and so on. 

However, this isn’t always the case. I have noticed that sometimes when I have text category values for my legend that Power BI can assign random colors, seemingly not even part of my theme. Until recently, I just accepted this as a quirk of Power BI and carried on with my report development. 

Read on for the answer.

Comments closed

Migrating SQL Server Databases to Aurora

Jamie Wick wounds me:

Our decision to move some (not all) SQL databases to AWS Aurora was partially a cost decision, but also an effort to simplify operational management by diversifying our database platforms. Now that second reason might seem counter-intuitive, but here are some details about the situation and how we came to our decision:

Read on for Jamie’s thoughts and tips to make the migration process easier.

Comments closed

Thoughts on the Data Lakehouse Concept

Teo Lachev shares some thoughts:

There has been a lot of noise surrounding a data lakehouse nowadays, so I felt the urge to chime in. In fact, the famous guy in cube, Patrick LeBlanc, gave a great presentation on this subject to our Atlanta Power BI Group and you can find the recording here (I have to admit we could have done better job with the recording quality, but we are still learning in the post-COVID era).

A lot of Teo’s thoughts relate to the data lake portion of the “lakehouse” metaphor and they’re worth reading. One small comment I’ll make is regarding the following statement:

And should you stage the data 1:1 from the source? In some cases, like the Get Data First aforementioned scenario, it might make sense. But in most cases, it would be much more efficient to stage the data in the shape you need it, which may necessitate joining multiple tables at the source (by the way, a relational server is the best place to handle joins).

The biggest benefit to getting data as “raw” as possible comes from the expectation that upstream data is ephemeral. What I mean is that, if you pull data on 2023-02-08, you’ll get some dataset. On 2023-04-08, you won’t be able to re-create that original dataset because your source will have changed, with inserts, updates, and deletes modifying the source.

Therefore, you want your copy of data into the Raw/Bronze layer to be as “opinion-free” as possible. This is because, if your data access code has a bug in it, you’ll never be able to re-create the old data. By contrast, if you have that raw data, you can start from there, re-apply your transformations, and replace your downstream results with corrected details. This is similar to the way a write-once service like Apache Kafka can benefit: you get history where there was none before.

So that’s the trade-off on the other side. Yeah, it’s typically more efficient to create queries and operate on the data before it goes into Raw, and you might need to do so for practical reasons, but there’s a risk of that code causing an irrevocable data problem and the risk goes up as you get more complicated operations.

Comments closed