2023-09-29 – Curated SQL

Parameterizing Databricks Notebooks with Widgets

Published 2023-09-29 by Kevin Feasel

Widgets provide a way to parameterize notebooks in Databricks. If you need to call the same process for different values, you can create widgets to allow you to pass the variable values into the notebook, making your notebook code more reusable. You can then refer to those values throughout the notebook.

Click through to learn more about the four types of widgets and how they work.

Comments closed

Overlaying Lines with Points in Base R

Published 2023-09-29 by Kevin Feasel

Steven Sanderson adds points to those lines:

In this blog post, we’ll explore how to overlay points or lines on a plot using Base R. We’ll use the plot() function to create the initial plot and then show how to overlay points with points() and lines with lines(). We’ll provide several examples, explaining each code block in simple terms, and encourage you to try them out on your own datasets.

Read on to see how. It’s also pretty easy to do in ggplot2 or other visualization libraries.

Comments closed

Creating a Postgres Cluster on AWS with pg_cirrus

Published 2023-09-29 by Kevin Feasel

Salman Ahmed builds a cluster:

pg_cirrus is a simple and automated solution to deploy highly available 3-node PostgreSQL clusters with auto failover. It is built using Ansible and to perform auto failover and load balancing we are using pgpool.

We understand that setting up 3-node HA cluster using pg_cirrus on cloud environment isn’t as simple as setting it up on VMs. In this blog we will guide you in setting up a 3-node HA cluster using pg_cirrus on AWS EC2 instances.

Read on for the step-by-step instructions.

Comments closed

Metadata-Driven Pipelines for Azure Data Factory Loads

Published 2023-09-29 by Kevin Feasel

Marc Bushong doesn’t want to copy and paste:

Developing ETLs/ELTs can be a complex process when you add in business logic, large amounts of data, and the high volume of table data that needs to be moved from source to target. This is especially true in analytical workloads involving Azure SQL when there is a need to either fully reload a table or incrementally update a table. In order to handle the logic to incrementally update a table or fully reload a table in Azure SQL (or Azure Synapse), we will need to create the following assets:

Metadata table in Azure SQL

This will contain the configurations needed to load each table end to end

Metadata driven pipelines

Parent and child pipeline templates that will orchestrate and execute the ETL/ELT end to end

Custom SQL logic for incremental processing

Dynamic SQL to perform the delete and insert based on criteria the user provides in the metadata table

Read on for the demonstration, which reads from one Azure SQL DB into another.

Comments closed

Data Temperature in Microsoft Fabric

Published 2023-09-29 by Kevin Feasel

Marc Lelijveld breaks out the thermometer:

As part of Microsoft Fabric, a new storage mode to connect from Power BI to data in OneLake has been introduced. Direct Lake it makes to possible to use your data from OneLake in Power BI without taking an additional copy of the data. Where Direct Lake promises to deliver the performance of Import-mode with the real-time capabilities of Direct query, it is time to have a closer look how data gets loaded into memory and delving into the concept of data dictionary temperature.

In this blog I will explain when data gets loaded into memory, elaborate on how you can measure the dictionary temperature of your data and the effect of queries on the temperature.

Click through to see what affects this measure and how.

Comments closed

The Risk of Changing MaxDOP

Published 2023-09-29 by Kevin Feasel

Erik Darling recommends caution:

Like in yesterday’s post about Cost Threshold For Parallelism, changing MAXDOP settings will have a universal effect on the workload.

This is true whether you change it at the server level for all databases, or at the database level using a database scoped configuration for a single database.

It is a guardrail to prevent unwanted conditions as a whole, like excessive concurrent parallel queries causing worker thread starvation (THREADPOOL waits), or just pushing CPU to 100% for extended periods of time.

Read on to see what Erik recommends you think about after any MaxDOP change.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30

Day: September 29, 2023

Parameterizing Databricks Notebooks with Widgets

Overlaying Lines with Points in Base R

Creating a Postgres Cluster on AWS with pg_cirrus

Metadata-Driven Pipelines for Azure Data Factory Loads

Data Temperature in Microsoft Fabric

The Risk of Changing MaxDOP