Press "Enter" to skip to content

Curated SQL Posts

Calibrating and Plotting a Time Series with healthyR.ts

Steven Sanderson builds a plot:

In time series analysis, it is common to split the data into training and testing sets to evaluate the accuracy of a model. However, it is important to ensure that the model is calibrated on the training set before evaluating its performance on the testing set. The {healthyR.ts} library provides a function called calibrate_and_plot() that simplifies this process.

Click through for the function’s input parameters and an example of how to use it.

Comments closed

Tips for Kafka Streams Developers

Ludovic Dehon shares some advice:

We built Kestra, an open-source data orchestration and scheduling platform, and we decided to use Kafka as the central datastore to build a scalable architecture. We rely heavily on Kafka Streams for most of our services (the executor and the scheduler) and have made some assumptions on how it handles the workload.

However, Kafka has some restrictions since it is not a database, so we need to deal with the constraints and adapt the code to make it work with Kafka. We will cover topics, such as using the same Kafka topic for source and destination, and creating a custom joiner for Kafka Streams, to ensure high throughput and low latency while adapting to the constraints of Kafka and making it work with Kestra.

Click through for several tips.

Comments closed

Aggregations and Distinct Counts in Power BI

Phil Seamark doesn’t have time to wait for Power BI to count:

This article aims to show how you can speed up distinct count calculations in Power BI using the built-in user-defined aggregations feature. The user-defined aggregation feature in Power BI is designed to work with direct query models and usually gets used for calculations such as SUM, MIN, MAX etc. However, it can also work well for distinct count calculations using the pattern shown in this article.

It’s an interesting partial aggregation approach which works really well when the distinct count is a small percentage of the total.

Comments closed

Synapse and Azure ML Pipelines

Santosh Thomas integrates two Azure products:

As more customers standardize on the Synapse data platform, enabling machine learning workflows through Azure Machine Learning (Azure ML) becomes particularly interesting. This is especially true as more customers look to bring their data engineering and data science practices together and mature capabilities on both sides.

The goal of this blog post is to highlight how Synapse and Azure ML can work well together to deliver key insights. This is motivated by a scenario where a customer modernized their data platform on Azure Synapse but was looking to improve their data science practices through Azure ML. The focus of this blog is to expose existing functionality, and it is not a “hardened” solution with security or other cloud best practice implementations. The workflow steps also assume some level of comfort with Python and working with the Azure Python SDKs.

There was a time in which Microsoft wanted us to remain in Synapse for machine learning tasks, but that time is gone: the emphasis is definitely to do machine learning tasks in Azure ML, regardless of where the data lives…unless there’s a Spark job involved, in which case things get all weird again.

Comments closed

Kafka Control and Data Planes

Sanjay Garde explains how the architecture of Apache Kafka solutions has expanded over time:

With the advent of service mesh and containerized applications, the idea of the control and data plane has become popular. A part of your application infrastructure, such as a proxy or sidecar, is dedicated to aspects such controlling traffic, access, governance, security, and monitoring and is referred to as the control plane. Another part of your application infrastructure that is used purely for processing your business transactions is referred to as the data plane.

Read on to see how the concept works at an architectural level.

Comments closed

Principles of Synapse Security

Liliam Leme provides an overview of security options in Azure Synapse Analytics:

This blog post will provide an overview of the Synapse security environment focused on Dedicated SQL Pool, Serverless SQL Pool, and Spark.

Security has many layers and frequently it will determine how you build your process. I start this post by reviewing several important security considerations which you can later apply to your Synapse environment. 

This is a fairly lengthy post and it still only covers a moderate amount of what you’d want to do for Azure Synapse Analytics. This is the downside to having a complex interplay of several products: there’s a lot to secure and a lot to think about along the way.

Comments closed

Building an Event Calendar from T-SQL and CSS

Aaron Bertrand builds some HTML:

In my last tip, I showed how to use T-SQL to generate HTML for fancy calendar visuals overlaid with event data from another table. As an extension of that tip, let’s now look at simplifying parts of that query by caching the date information in a calendar table to streamline the outer queries and avoid complications caused by different DATEFIRST settings.

This is the follow-on from a prior post (linked in the lede here), so it would make sense to read that one first if you haven’t already.

Comments closed

ADX Dashboards Now Generally Available

Michal Bar provides an overview of Azure Data Explorer functionality now generally available :

Each ADX dashboard is a collection of tiles, optionally organized in pages, where each tile has an underlying query and a visual representation. Using the web UI, you can natively export Kusto Query Language (KQL) queries to a dashboard as visuals and later modify their underlying queries and visual formatting as needed. In addition to ease of data exploration, this fully integrated Azure Data Explorer dashboard experience provides improved query and visualization performance.

Read on to learn more.

Comments closed

Object Ownership in Postgres

Ryan Booz divvies up ownership rights:

In the first security article, PostgreSQL Basics: Roles and Privileges, I discussed how roles (users and groups) are created and managed in PostgreSQL Depending on your background with permissions, particularly in other database products, some of those nuances in how permissions work may have been surprising.

Understanding how roles and privileges work in Postgres is key to understanding the next, and often more confusing part of security, object ownership. Although there are many privileges that can be assigned in Postgres, object ownership comes with a specific level of privilege that cannot be transferred to other roles. Understanding the implications of this is essential to the management of your database schema, and access to the objects it contains, over time.

Click through to learn more about how Postgres handles object ownership and defines who gets access to which objects.

Comments closed

Matrix Multiplication in R with DuckDB and SQLite

Karsten Weinert compares two databases:

On my laptop with 16 GB RAM, I would like to perform a matrix-vector multiplication with a sparse matrix of around 10 million columns and 2500 rows. The matrix has approximately only 2% non-zero entries, but this are still 500 million numbers and the column/row information, too large to work comfortably in-memory.

A while ago, I tried using sqlite for this task. It kind of worked, but was too slow to be useful. This weekend, I revisited the problem and tried using duckdb.

Read on for the results. I’ve heard enough positives about DuckDB over the past few weeks that it makes me want to try it out. H/T R-Bloggers.

Comments closed