Curated SQL – Page 274 – A Fine Slice Of SQL Server

Transferring Linear Model Coefficients

Published 2024-07-15 by Kevin Feasel

A quick glance through the scikit-learn documentation on linear models, or the CRAN task view on Mixed, Multilevel, and Hierarchical Models in R reveals a number of different procedures for fitting models with linear structure. Each of these procedures meet different needs and constraints, and some of them can be computationally intensive to compute. But in the end, they all have the same underlying structure: outcome is modelled as a linear combination of input features.

But the existence of so many different algorithms, and their associated software, can obscure the fact that just because two models were fit differently, they don’t have to be run differently. The fitting implementation and the deployment implementation can be distinct. In this note, we’ll talk about transferring the coefficients of a linear model to a fresh model, without a full retraining.

I had a similar problem about 18 months ago, though much easier than the one Nina describes, as I did have access to the original data and simply needed to build a linear regression in Python that matched exactly the one they developed in R. Turns out that’s not as easy to do as you might think: the different languages have different default assumptions that make the results similar but not the same, and piecing all of this together took a bit of sleuthing.

Comments closed

Generating a Schedule in R

Published 2024-07-15 by Kevin Feasel

Tomaz Kastrun builds timetables:

Each meeting slot is represented as block (lasts arbitrary number of hours, mostly form 1 to 4). For conducting every block required are: pair of departmetns, room, time-slot. It is also know in advance which groups attend which class and all rooms are the same size.

Input data all departments names, room names and time-slots.
Output data are rooms and timeslots for pair of departments in a time-schedule.

Click through for the code and explanation.

Comments closed

The Cost of Maintaining Extended Statistics in Postgres

Published 2024-07-15 by Kevin Feasel

Andrew Lepikhov breaks out the stopwatch:

In the previous post, I passionately advocated for integrating extended statistics and, moreover, creating them automatically. But what if it is too computationally demanding to keep statistics fresh?

This time, I will roll up my sleeves, get into the nitty-gritty and shed light on the burden extended statistics put on the digital shoulders of the database instance. Let’s set aside the cost of using this type of statistics during planning and focus on one aspect – how much time we will spend in an ANALYZE command execution.

Read the whole thing if you’re a Postgres admin or developer.

Comments closed

An Overview of Normal Forms

Published 2024-07-15 by Kevin Feasel

Daniel Calbimonte talks normalization:

Various levels of normalization in SQL can be used to reduce data redundancy and have a better-structured relational data model. This tutorial looks at these various levels with explanations and examples in Microsoft SQL Server for beginners.

I disagree with part of Daniel’s explanation of 1NF: I believe that the idea of atomicity, as Daniel defines it, is not part of 1NF. I’m basing this off of CJ Date’s definition of first normal form:

Given relvar R with heading H containing attributes A1…An of types T1…Tn, all tuples follow heading H and have one value of type Ti for attribute Ai.

All this says is that we have a single value per attribute in a tuple. “LeBron James, Lakers” and “Stephen Curry, Warriors” are perfectly reasonable values for attributes in first normal form. In Database Design and Relational Theory, Date spends a few pages covering the idea of atomicity and how there’s no good explanation for what, exactly, “atomic” means. Even in Daniel’s example, you could break down player and coach names further, not only into first and last names, but also subsets of characters within those names, like syllables. The closest thing I have for atomicity is the idea that something is atomic when it is at the lowest level given a particular set of data requirements. But that’s not a mathematical rule like the rules of normalization. It’s a business rule, and necessarily fuzzier and subjective.

That said, I like the rest of Daniel’s list and appreciate going to 5th normal form.

1 Comment

Tips for Using Azure Backup for SQL Server

Published 2024-07-15 by Kevin Feasel

Anna Hoffman, et al, share some tips and tricks:

We recently worked with a customer that migrated their Windows and SQL Servers to Azure that wanted to use Azure Backup for a consistent enterprise backup experience. The SQL Servers had multiple databases of varying sizes, some that were multi-terabyte. A single Azure Backup vault was deployed using a policy that was distributed to all the SQL Servers. During the migration process, the customer observed issues with the quality of the backups and poor virtual machine performance while the backups were running. We worked through the issues by reviewing the best practices, modifying the Azure Backup configuration, and changing the virtual machine SKU. For this specific example, the customer needed to change their SKU from Standard_E8bds_v5 to Standard_E16bds_v5 to support the additional IOPS and throughput required for the backups. They used premium SSD v1 and the configuration met the IOPS and throughput requirements.

In this post, we share some of the techniques we used to identify and resolve the performance issues that were observed.

Read on to learn more about how Azure Backup works and troubleshooting mechanisms.

Comments closed

Working with GraphQL in Microsoft Fabric

Published 2024-07-15 by Kevin Feasel

Stepan Resl takes us through what’s available today:

It is an alternative to REST API and enables users to fetch data from multiple sources using a single query. Compared to REST API, GraphQL is much more flexible and allows users to retrieve only the data they need, reducing the amount of data transferred between the client and server. It also uses a single endpoint, reducing the number of requests made to the server. It is a platform and programming language-independent specification, meaning it can be used with any language and on any platform.

GraphQL is defined by an API schema written in the GraphQL schema definition language. Each schema specifies the types of data that users can request or modify, and the relationships between these types. The term “resolver” is often mentioned in relation to GraphQL. It refers to a function or functions responsible for fetching data for a specific field in the schema and provides instructions for converting the GraphQL operation into data.

As a quick reminder for the data-minded: GraphQL and graph databases are orthogonal to one another.

Comments closed

Test Isolation with Kafka

Published 2024-07-12 by Kevin Feasel

Anton Belyaev builds some tests:

The experience of running Kafka in test scenarios has reached a high level of convenience thanks to the use of Test containers and enhanced support in Spring Boot 3.1 with the @ServiceConnection annotation. However, writing and maintaining integration tests with Kafka remains a challenge. This article describes an approach that significantly simplifies the testing process by ensuring test isolation and providing a set of tools to achieve this goal. With the successful implementation of isolation, Kafka tests can be organized in such a way that at the stage of result verification, there is full access to all messages that have arisen during the test, thereby avoiding the need for forced waiting methods such as Thread.sleep().

This method is suitable for use with Test containers, Embedded Kafka, or other methods of running the Kafka service (e.g., a local instance).

Click through for that approach.

Comments closed

The Framework Laptop and Right to Repair

Published 2024-07-12 by Kevin Feasel

Heather Joslyn summarizes an interview:

Chances are, if you’ve lived through a few innovation cycles, you’ve got too many old computers — and their cables — cluttering your house. Do you think that if you had the right to repair your devices, to swap out obsolete components for more performant ones, you wouldn’t keep piling up castoff electronics?

So does Matt Hartley, guest on this On the Road episode of The New Stack Makers, recorded at Open Source Summit North America in April.

This is a bit out of left field for Curated SQL content, but to be fair, when has that ever stopped me? I’ve owned two Framework laptops (one of which is my daily driver and the other I gave away when it stopped being my daily driver) and really like the company because of its repair-friendly ethos, making parts and schematics available—as was the norm for companies until recently. Part of owning a thing is having the ability to maintain and repair it.

Comments closed

Building a Full-Stack App with Kafka and Node.js

Published 2024-07-12 by Kevin Feasel

Lucia Cerchie builds an application:

A well-known debate: tabs or spaces? Sure, we could set up a Google Form to collect this data, but where’s the fun in that? Let’s settle the debate, Kafka-style. We’ll use the new confluent-kafka-javascript client (not in general availability yet) to build an app that produces the current state of the vote counts to a Kafka topic and consumes from that same topic to surface them to a JavaScript frontend.

Why are we using this client in particular? It comes from Confluent and is intended for use with Apache Kafka® and Confluent Platform. It’s compatible with Confluent’s cloud offering as well. It builds on concepts from the two most popular Kafka JavaScript client libraries: KafkaJS and node-rdkafka. The functionality is based on node-rdkafka, however, it also provides a way to interface with the library via methods similar to those in KafkaJS due to their developer-friendy nature. There are two APIs: the first implements the functionality based on node-rdkafka; the second is a promisified API with the methods akin to those in KafkaJS. By choosing this client, we can access wide functionality and have a smooth developer experience via the dev-friendly methods.

Click through for the code and explanation. Meanwhile, tabs in my heart, spaces in my job.

Comments closed

A/B Testing with Survival Analysis in R

Published 2024-07-12 by Kevin Feasel

Iyar Lin combines two great flavors:

Usually when running an A/B test analysts assign users randomly to variants over time and measure conversion rate as the ratio between the number of conversions and the number of users in each variant. Users who just entered the test and those who are in the test for 2 weeks get the same weight.

This can be enough for cases where a conversion either happens or not within a short time frame after assignment to a variant (e.g. Finishing an on-boarding flow).

There are however many instances where conversions are spread over a longer time frame. One example would be first order after visiting a site landing page. Such conversions may happen within minutes, but a large churn could also happen within days after the first visit.

Read on for the scenario, as well as a simulation. I will note that, in the digital marketing industry, there’s usually a hard cap on number of days where you’re able to attribute a conversion to some action for exactly the reason Iyar mentions. H/T R-Bloggers.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Curated SQL Posts