Press "Enter" to skip to content

Day: December 20, 2023

An Introduction to Poisson Regression

Steven Sanderson talks about a discrete form of regression:

Hey data enthusiasts! Today, we’re diving into the fascinating world of count data and its trusty sidekick, Poisson regression. Buckle up, because we’re about to explore how this statistical powerhouse helps us understand the factors influencing, you guessed it, counts.

Scenario: Imagine you’re an education researcher, eager to understand how a student’s GPA might influence their job offer count after graduation. But hold on, job offers aren’t continuous – they’re discrete, ranging from 0 to a handful. That’s where Poisson regression comes in!

I have an unhealthy love for Poisson techniques, so I highly recommend checking this out.

Comments closed

Advent of Code Day 5

Kevin Wilkie continues the advent of code. Part 1 starts with data prep and ends with some of the analysis:

There was a lot going on with this one, so let’s go ahead and get started. First thing we want to do, as always, is to go ahead and load all of our data into SQL Server. With this one, I cheated a little and loaded it into multiple tables to start with – since there are 7 different groups of data that we have to play with.

Could I have loaded it into one table and pull from there into other tables? Absolutely! Do I really want to? Heck, no!

Part 2 continues cross-referencing over a larger set of data:

This time, the meaning of the seed numbers is slightly changing on us. Instead of it only being 20 seeds we have to cross-reference (at least that’s how many seeds my input lists out), we now have 10 seeds and a rather large range of numbers to work with after each of those 10.

Thankfully, we can still use our Tally table to create yet another table with all of the seeds listed in it – yes, all of them!

Read on for Kevin’s solution.

Comments closed

Database Compatibility Levels and SQL Server Versions

Kendra Little demands change:

According to Microsoft’s documentation, “Database compatibility level … allow[s] the SQL Server Database Engine to be upgraded while keeping the same functional status for connecting applications by maintaining the same pre-upgrade database compatibility level.”

But these days, the “functional status” of a database at a given compatibility level differs depending on whether you’re using SQL Server, Azure SQL Managed Instance, or Azure SQL Database – and in the hosted versions it may change anytime without notice. Surprise, your database behaves differently now!

The whole concept is breaking down, and this is bad news for users of both managed services and the boxed product.

Read on to learn more about the issue.

Comments closed

Concatenating Many Files in Azure Blob Storage

Drew Furgiuele concatenates a lot of files:

Lately, I’ve found myself with a few requests from friends and users that have a particular problem: they’ve got themselves a data lake in Azure, and they can read and write files just fine to it. The problem, though, is that sometimes they need to take a series of files and mash them all together, or as the cool kids call it: concatenate them. And when it comes to third party tools and methods that can do the trick, you’re spoiled for choice: Azure Data Factory, Spark via Databricks, or even PowerShell.

Case in point: I was working with someone who had tens of thousands of CSV files that they needed to merge together into one big file, but they were already out in their Azure storage account. That doesn’t sound so bad, does it? 

Drew explains why it is, but also why it isn’t. So click through and check that out.

Comments closed

Tokenizing SQL Queries

Phil Factor has a new script:

Imagine this: you have several directories full of SQL script files, and you need to know where a certain table is used. You’d rather like the context too, so you can check the whole SQL Expression and work out why it is running so slowly. Maybe, from that same daunting set of several directories, you need to search for a comment, either end of line, or block comment, perhaps structured. It could be that you just need to execute each query or statement in turn to check performance.

It’s not unusual to want to search just within strings. Although simple searches can be effective, you will at some point need a tool that is able to recognise and return a collection of strings representing the SQL code, divided up into the respective components of the SQL language.

For this article, we’ll use a PowerShell cmdlet, called Tokenize-SQLString, which is in my GitHub repository.

Read on for examples of how it works and the types of things you can do with it.

Comments closed

Window Function Execution Plans with RANGE

Hugo Kornelis continues a series on explaining the execution plans for window functions:

This is part twenty-six of the plansplaining series. And already the fourth episode about window functions. The first of those posts covered basic window functions; the second post focused on fast-track optimization for running aggregates, and the third post explained how the optimizer works around the lack of execution plan support for UNBOUNDED FOLLOWING.

But all of those were about OVER specifications that use the ROWS keyword. Let’s now look at the alternative, the RANGE keyword.

Click through to see how the various options work with RANGE. By the way, I still want range intervals, like how Postgres implements them, where you can define an interval of X days/hours/minutes/whatever rather than a specific number of rows. Maybe one of these versions…

Comments closed

Looping through Lakehouses in Microsoft Fabric Spark Jobs

Dennes Torres builds a loop:

I have published videos and articles before about Lakehouse maintenance. In this article I want to address a missing point for a lot of Fabric administrators: How to do maintenance on multiple lakehouses that are located in different workspaces.

One of the videos I have published explains the maintenance of multiple lakehouses, but only addresses maintenance in a single workspace. Is it a good idea to keep multiple lakehouses in the same workspace? Probably not.

Click through for the process.

Comments closed