2023-08-10 – Curated SQL

Creating a Simple Date Dimension in Databricks

Published 2023-08-10 by Kevin Feasel

A date dimension is extremely useful and is required by most BI applications. This kind of dimension has a key of time level (day, month, etc.), and attributes that describe it such as year, month, etc. In your BI model, you join this dimension to facts on their date fields, to aggregate from day level to week, month, and year.

In this post, I will demonstrate how to create a date dimension on Azure Databricks using Python. A link to the complete Databricks notebook is at the end of the post.

Check out the code, as well as explanation, in that post.

Comments closed

Managing Plot Parameters in R

Published 2023-08-10 by Kevin Feasel

Steven Sanderson switches up a visual:

When it comes to data visualization in R, the par() function is an indispensable tool that often goes overlooked. This function allows you to control various graphical parameters, unleashing a world of customization possibilities for your plots. In this blog post, we’ll demystify the par() function, break down its syntax, and provide you with hands-on examples to help you create stunning visualizations.

Click through to check it out. My loyalties definitely lie with ggplot2 for static visual development in R but it’s definitely not the only way to get images to look the way you want them.

Comments closed

Combining Cosmos DB and Azure Search

Published 2023-08-10 by Kevin Feasel

Hasan Savran does some looking:

In my previous post, I discussed the process of establishing a Free-text search for Azure Cosmos DB. Towards the end, I demonstrated how to carry out a free-text search using the Azure Portal. Now, I will guide you on how to perform this search using code. To perform this search by code, I created a basic console application and added Azure.Search.Documents and Microsoft.Azure.Cosmos.

Click through for that demonstration.

Comments closed

From Join to Lookup in KQL on Power BI

Published 2023-08-10 by Kevin Feasel

Dany Hoter gives us a workaround:

Many users who try ADX in direct query mode encounter errors right away.

The errors complain about lack of memory.

If the tables are small enough, it may work but still performance will not be as advertised on TV.

The reason in most cases is the behavior of joins in ADX as they are created by PBI.

In this article I’ll show different approaches to joining tables as used by PBI for related tables or as can be expressed in KQL in general.

I created a special table in the help cluster with 31 million rows that is big enough to demonstrate the performance differences between the variations.

Read the whole thing. This one’s a little surprising to me.

Comments closed

Data Lake Serving Layers

Published 2023-08-10 by Kevin Feasel

James Serra has layers, like an onion:

Data lakes typically have three layers: raw, cleaned, and presentation (also called bronze, silver, and gold if using the medallion architecture popularized by Databricks). I talk about this is my prior blog post on Data lake architecture. Many times, companies will create a fourth layer outside of the data lake that I call the relational serving layer. I’ve been having conversations recently with companies about the need for another type of fourth layer, which I will call the physical serving layer. In this blog post I’ll discuss the relational serving layer and the physical serving layer.

Read on to learn more about these.

Comments closed

Freshness Labels on Content

Published 2023-08-10 by Kevin Feasel

Steve Jones does some noodling:

I chose the title slightly to poke at Stack Overflow (SO), but the same take expressed in this tweet could be said about SQL Server Central. It’s not quite the same as anyone can answer questions on SQL Server Central.

The tweet is a (long) hot take from Jerry Nixon, a C# developer and MS evangelist in Denver. Essentially he says that a lot of the SO answers are wrong, especially as the software and languages change. Old answers are upvoted, and remain at the top of the list, even as newer answers might be better. People don’t like the behavior on SO of moderators and people who post, which is something we’ve tried to avoid or limit here at SQL Server Central. We want there to be professional discussions. SO also doesn’t allow much discussion or nuance in the questions or answers.

This isn’t just a SO problem or am SSC one.

Read the whole thing. This is a huge problem with search engines today and there’s a hacky solution for it. Going back to the original PageRank algorithm that Google used, your rank on the search results list was heavily tied to how many individuals linked back to you. Older pages tend to have more linkbacks because they’ve been around longer, and so there’s a built-in bias toward older content. Google, in particular, has done a lot to work around this problem, but there’s a real issue with timeliness in articles: sometimes, you want the brand new information (like say, product recommendations); other times, you want older or even the original information (such as if you’re researching historical activities). The problem is that there’s no good way to indicate this to the search engines we have, so the hacky solution is for content creators to create sites like “The May 2023 Guide to Blahblahblah” and for search engine users to look for terms like “2023 blahblahblah” so they can avoid all of the outdated 2022 and 2021 blahblahblah discussions.

There’s also a story in here around keeping things up to date. Some people are good about that—they’ll go back and update years-old blog posts based on what’s new and happening. I am not one of those people.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Day: August 10, 2023

Creating a Simple Date Dimension in Databricks

Managing Plot Parameters in R

Combining Cosmos DB and Azure Search

From Join to Lookup in KQL on Power BI

Data Lake Serving Layers

Freshness Labels on Content