2024-07-15 – Curated SQL

Transferring Linear Model Coefficients

Published 2024-07-15 by Kevin Feasel

A quick glance through the scikit-learn documentation on linear models, or the CRAN task view on Mixed, Multilevel, and Hierarchical Models in R reveals a number of different procedures for fitting models with linear structure. Each of these procedures meet different needs and constraints, and some of them can be computationally intensive to compute. But in the end, they all have the same underlying structure: outcome is modelled as a linear combination of input features.

But the existence of so many different algorithms, and their associated software, can obscure the fact that just because two models were fit differently, they don’t have to be run differently. The fitting implementation and the deployment implementation can be distinct. In this note, we’ll talk about transferring the coefficients of a linear model to a fresh model, without a full retraining.

I had a similar problem about 18 months ago, though much easier than the one Nina describes, as I did have access to the original data and simply needed to build a linear regression in Python that matched exactly the one they developed in R. Turns out that’s not as easy to do as you might think: the different languages have different default assumptions that make the results similar but not the same, and piecing all of this together took a bit of sleuthing.

Comments closed

Generating a Schedule in R

Published 2024-07-15 by Kevin Feasel

Tomaz Kastrun builds timetables:

Each meeting slot is represented as block (lasts arbitrary number of hours, mostly form 1 to 4). For conducting every block required are: pair of departmetns, room, time-slot. It is also know in advance which groups attend which class and all rooms are the same size.

Input data all departments names, room names and time-slots.
Output data are rooms and timeslots for pair of departments in a time-schedule.

Click through for the code and explanation.

Comments closed

The Cost of Maintaining Extended Statistics in Postgres

Published 2024-07-15 by Kevin Feasel

Andrew Lepikhov breaks out the stopwatch:

In the previous post, I passionately advocated for integrating extended statistics and, moreover, creating them automatically. But what if it is too computationally demanding to keep statistics fresh?

This time, I will roll up my sleeves, get into the nitty-gritty and shed light on the burden extended statistics put on the digital shoulders of the database instance. Let’s set aside the cost of using this type of statistics during planning and focus on one aspect – how much time we will spend in an ANALYZE command execution.

Read the whole thing if you’re a Postgres admin or developer.

Comments closed

An Overview of Normal Forms

Published 2024-07-15 by Kevin Feasel

Daniel Calbimonte talks normalization:

Various levels of normalization in SQL can be used to reduce data redundancy and have a better-structured relational data model. This tutorial looks at these various levels with explanations and examples in Microsoft SQL Server for beginners.

I disagree with part of Daniel’s explanation of 1NF: I believe that the idea of atomicity, as Daniel defines it, is not part of 1NF. I’m basing this off of CJ Date’s definition of first normal form:

Given relvar R with heading H containing attributes A1…An of types T1…Tn, all tuples follow heading H and have one value of type Ti for attribute Ai.

All this says is that we have a single value per attribute in a tuple. “LeBron James, Lakers” and “Stephen Curry, Warriors” are perfectly reasonable values for attributes in first normal form. In Database Design and Relational Theory, Date spends a few pages covering the idea of atomicity and how there’s no good explanation for what, exactly, “atomic” means. Even in Daniel’s example, you could break down player and coach names further, not only into first and last names, but also subsets of characters within those names, like syllables. The closest thing I have for atomicity is the idea that something is atomic when it is at the lowest level given a particular set of data requirements. But that’s not a mathematical rule like the rules of normalization. It’s a business rule, and necessarily fuzzier and subjective.

That said, I like the rest of Daniel’s list and appreciate going to 5th normal form.

1 Comment

Tips for Using Azure Backup for SQL Server

Published 2024-07-15 by Kevin Feasel

Anna Hoffman, et al, share some tips and tricks:

We recently worked with a customer that migrated their Windows and SQL Servers to Azure that wanted to use Azure Backup for a consistent enterprise backup experience. The SQL Servers had multiple databases of varying sizes, some that were multi-terabyte. A single Azure Backup vault was deployed using a policy that was distributed to all the SQL Servers. During the migration process, the customer observed issues with the quality of the backups and poor virtual machine performance while the backups were running. We worked through the issues by reviewing the best practices, modifying the Azure Backup configuration, and changing the virtual machine SKU. For this specific example, the customer needed to change their SKU from Standard_E8bds_v5 to Standard_E16bds_v5 to support the additional IOPS and throughput required for the backups. They used premium SSD v1 and the configuration met the IOPS and throughput requirements.

In this post, we share some of the techniques we used to identify and resolve the performance issues that were observed.

Read on to learn more about how Azure Backup works and troubleshooting mechanisms.

Comments closed

Working with GraphQL in Microsoft Fabric

Published 2024-07-15 by Kevin Feasel

Stepan Resl takes us through what’s available today:

It is an alternative to REST API and enables users to fetch data from multiple sources using a single query. Compared to REST API, GraphQL is much more flexible and allows users to retrieve only the data they need, reducing the amount of data transferred between the client and server. It also uses a single endpoint, reducing the number of requests made to the server. It is a platform and programming language-independent specification, meaning it can be used with any language and on any platform.

GraphQL is defined by an API schema written in the GraphQL schema definition language. Each schema specifies the types of data that users can request or modify, and the relationships between these types. The term “resolver” is often mentioned in relation to GraphQL. It refers to a function or functions responsible for fetching data for a specific field in the schema and provides instructions for converting the GraphQL operation into data.

As a quick reminder for the data-minded: GraphQL and graph databases are orthogonal to one another.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Day: July 15, 2024

Transferring Linear Model Coefficients

Generating a Schedule in R

The Cost of Maintaining Extended Statistics in Postgres

An Overview of Normal Forms

Tips for Using Azure Backup for SQL Server

Working with GraphQL in Microsoft Fabric