February 2023 – Page 9

Thoughts on the Data Lakehouse Concept

Published 2023-02-09 by Kevin Feasel

There has been a lot of noise surrounding a data lakehouse nowadays, so I felt the urge to chime in. In fact, the famous guy in cube, Patrick LeBlanc, gave a great presentation on this subject to our Atlanta Power BI Group and you can find the recording here (I have to admit we could have done better job with the recording quality, but we are still learning in the post-COVID era).

A lot of Teo’s thoughts relate to the data lake portion of the “lakehouse” metaphor and they’re worth reading. One small comment I’ll make is regarding the following statement:

And should you stage the data 1:1 from the source? In some cases, like the Get Data First aforementioned scenario, it might make sense. But in most cases, it would be much more efficient to stage the data in the shape you need it, which may necessitate joining multiple tables at the source (by the way, a relational server is the best place to handle joins).

The biggest benefit to getting data as “raw” as possible comes from the expectation that upstream data is ephemeral. What I mean is that, if you pull data on 2023-02-08, you’ll get some dataset. On 2023-04-08, you won’t be able to re-create that original dataset because your source will have changed, with inserts, updates, and deletes modifying the source.

Therefore, you want your copy of data into the Raw/Bronze layer to be as “opinion-free” as possible. This is because, if your data access code has a bug in it, you’ll never be able to re-create the old data. By contrast, if you have that raw data, you can start from there, re-apply your transformations, and replace your downstream results with corrected details. This is similar to the way a write-once service like Apache Kafka can benefit: you get history where there was none before.

So that’s the trade-off on the other side. Yeah, it’s typically more efficient to create queries and operate on the data before it goes into Raw, and you might need to do so for practical reasons, but there’s a risk of that code causing an irrevocable data problem and the risk goes up as you get more complicated operations.

Comments closed

Last Refresh Date Measure in Power BI

Published 2023-02-09 by Kevin Feasel

Paul Turley keeps it fresh:

Business users often need to know how fresh the data is that they see on a Power BI report. This requirement can be interpreted to mean either “when was the last date & time that the source data was updated?” or “when was the last date & time that the data model was refreshed?”

Read on for Paul’s current solution to the problem of knowing the last time this data was refreshed.

Comments closed

Azure Synapse Analytics January 2023 Updates

Published 2023-02-09 by Kevin Feasel

Ryan Majidimehr shares an overview of what the Synapse team has been working on:

ADX contains native support for detecting anomalies over multiple time series by using the function series_decompose_anomalies(). This function can analyze thousands of time series in seconds, enabling near real time monitoring solutions and workflows based on ADX. Univariate analysis is simpler, faster, and easily scalable and is applicable to most real-life scenarios. However, there are some cases where it might miss anomalies that can only be detected by analyzing multiple metrics at the same time.

For some scenarios, there might be a need for a true multivariate model, jointly analyzing multiple metrics. This can be achieved now in ADX using the new Python-based multivariate anomaly detection functions.

The themes for this month are Spark, Data Exploration (via Kusto), and data integration.

Comments closed

Thoughts on Auditing in Azure SQL DB

Published 2023-02-09 by Kevin Feasel

Michael Bourgon shares some thoughts on Azure SQL Database Auditing:

A couple of notes when using Azure SQL DB Auditing

1) this is actually kinda cool, overall. I plan on mining this in the future for code changes, etc.

2) You can filter it! predicateexpression is the term you’re looking for, available via the powershell module

Read on for more thoughts on the topic.

Comments closed

Approximation with the Mediant

Published 2023-02-08 by Kevin Feasel

John Cook didn’t make a typo:

Suppose you are trying to approximate some number x and you’ve got it sandwiched between two rational numbers:

a/b < x < c/d.

Now you’d like a better approximation. What would you do?

The obvious approach would be to take the average of a/b and c/d. That’s fine, except it could be a fair amount of work if you’re doing this in your head.

Read on for a separate approach taking the mediant (not median) of the two fractions.

Comments closed

Plotting Tables with ggplot2

Published 2023-02-08 by Kevin Feasel

Sebastian Sauer plots some tables for us.

The post is a little too terse to grab a lede from, but it does show a good way of generating a formatted image of a table using the ggplot2 library in R.

Comments closed

Solving a Bank Robbery with KQL

Published 2023-02-08 by Kevin Feasel

Paul Hernandez gets Inspector Cluseau’d:

The Kusto Detective Agency was created last year (2022) to let people learn KQL solving funny puzzles at the same time. For me it has been a kind of scape room game, where you can exercise your solving problem capabilities and learn ADX at the same time. If you want to participate just follow this link

Click through for the scenario which got Paul.

Comments closed

Splitting Strings to a Table in Snowflake

Published 2023-02-08 by Kevin Feasel

Kevin Wilkie puts on the flannel and grabs his database lumberjack axe to split some strings:

But, sometimes, you want a small list of data inherent to a query in SnowFlake. And that’s what I want to talk about today.

In SQL Server, you would create a temp table and then insert the data into it. But in Snowflake, there may be a better / easier way.

Let’s use the function SPLIT_TO_TABLE. Shockingly, it does what’s on the label – it split data and puts it into a table.

Click through for an example. Also check out the Snowflake documentation, where they make use of the lateral operator (the ANSI version of APPLY()) to generate results for multiple strings and make use of the SEQ column.

Comments closed

FILTER vs CASE in Postgres

Published 2023-02-08 by Kevin Feasel

Lukas Eder hits on an interesting data point:

I’ve found an interesting question on Twitter, recently. Is there any performance impact of using FILTER in SQL (PostgreSQL, specifically), or is it just syntax sugar for a CASE expression in an aggregate function?

Click through for the answer and your daily reminder that SQL variants aren’t pure fourth generation languages—if they were, the optimizer would take all possible constructs of a given desired operation and generate the same execution plan for all of them.

Comments closed

Deploying to Dedicated SQL Pools via GitHub Action

Published 2023-02-08 by Kevin Feasel

Kevin Chant deploys some database objects:

In this post I want to cover using Azure SQL Deploy v2 for dedicated SQL Pool deployments using GitHub Actions. Which is the GitHub Action that is also known as sql-action.

This post provides an overview of what is new and provides plenty of links along the way.

Check out what’s new and the way Kevin handles the deployment.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28

Month: February 2023