Press "Enter" to skip to content

Month: November 2024

An Overview of the Naive Bayes Class of Algorithms

Harris Amjad takes us through a rather useful class of algorithms for classification:

As AI and Machine Learning have increased in popularity, especially Large Language Models, more professionals have explored how these systems work. Unfortunately, some put the cart before the horse, where they take on more complex algorithms before learning to pave the foundation, resulting in faded interest in the topic. This tip will introduce a simple probabilistic, yet powerful classifier, the Naïve Bayes Model, and implement it in Python.

I like using the Naive Bayes variants, despite the fact that it is not Bayesian and arguably isn’t very naive. The reason I like to use this class of algorithm is that it’s fast, easy, and gives you a useful baseline for quality. If you need to meet some specific quality threshold (say, accuracy > 85% or F1-score above 0.8), you can get an answer quickly with Naive Bayes. If that answer is anywhere near your threshold, the problem is likely solvable. If your answer is way below the threshold, it’s probably not worth spending the time or compute effort trying out a variety of other algorithms.

Comments closed

Analyzing Semantic Model Logs via Microsoft Fabric

Sandeep Pawar parses the logs:

Workspace Monitoring was one of my favorite announcements at MS Ignite ‘24 this week. It logs events from Fabric items such as Semantic Models, Eventhouse, GraphQL to a KQL database that’s automatically provisioned and managed in that workspace. Currently it’s limited to these three items but hopefully other (especially spark and pipelines) will be added soon. Read the announcement by Varun Jain (PM, Microsoft) on this for details. 

Click through for some thoughts from Sandeep, as well as a variety of useful queries.

Comments closed

Generating Effect Plots in Python and R

MIchael Mayer builds some effect plots:

The plots show different types of feature effects relevant in modeling:

  • Average observed: Descriptive effect (also interesting without model).
  • Average predicted: Combined effect of all features. Also called “M Plot” (Apley 2020).
  • Partial dependence: Effect of one feature, keeping other feature values constant (Friedman 2001).
  • Number of observations or sum of case weights: Feature value distribution.
  • R only: Accumulated local effects, an alternative to partial dependence (Apley 2020).

Click through to see how they both work.

Comments closed

New Syntax in SQL Server 2022

Steve Jones talks syntax:

At the recent 2024 PASS Data Community Summit, I saw a lightning talk from Mikey Bronowski on the New T-SQL Functions in SQL Server 2022.  Before the talk, I made a joke with him that none of these were new because it’s 2024. They’d been out for 2 years.

Mikey did a nice job, given that he only had 10 minutes, but it was enough to give attendees an idea of some new things they might investigate to use in their own code. If you missed the talk, or you don’t have access to the recordings, we have a series on SQL Server Central that covers these (Part 1Part 2Part 3) and helps you understand the new options. The list of language changes is also in the MS Docs under What’s New in the Language.

Steve asks about how much utilization there is of these. I think the problem is that so few organizations have adopted SQL Server 2022 that knowledge that these things even exist is quite low.

By the way, I do have a script covering many of these new bits of syntax as part of a talk on getting up to speed with core T-SQL changes since SQL Server 2012.

Comments closed

The Difficulty of Deletion in PostgreSQL

Radim Marek takes us through the process of deleting rows:

Your database is ticking along nicely – until a simple DELETE brings it to its knees. What went wrong? While we tend to focus on optimizing SELECT and INSERT operations, we often overlook the hidden complexities of DELETE. Yet, removing unnecessary data is just as critical. Outdated or irrelevant data can bloat your database, degrade performance, and make maintenance a nightmare. Worse, retaining some types of data without valid justification might even lead to compliance issues.

Read on to learn about the process, some challenges, and a common pattern for resolving these challenges. The solution is pretty similar in SQL Server as well: batching delete operations, ideally with a supporting index.

Comments closed

AWS DMS and a LOB Bug

Richard O’Riordan fixes an issue:

The table over in our Postgres cluster is similar except for the data type “text” being used instead of “varchar”. All kind of boring so far, but what we noticed that on some very rare occasions the “largevalue” column was empty over in the PostgreSQL database even though for that row it was populated in SQL Server.

This seemed odd to me, like you would expect if there was some error inserting the row on the PostgreSQL side then since it is all done within a transaction that it would either all succeed or all fail, how would the row be partially inserted, i.e. missing this text value.

Read on for the story and several tests of how things work.

Comments closed

Query Hash Value Changes

Kendra Little digs into a problem:

This is the worst bug I’ve found in SQL Server to date. Previously, my top find was SQL Server Online Index Rebuild sometimes happens offline without warning. This one has taken top slot because it makes my life more difficult on a daily basis.

Background: SQL Server generates a query_hash for each query. This is stored in sys.query_store_query and it’s one of the primary ways you can identify what a query is across different Query Stores, or even the same Query Store over time, as surrogate query_id values get reset if Query Store is cleared or data ages on. The query_hash is a “Zobrist hash over the shape of the individual query, based on the bound (input) logical query tree. Query hints aren’t included as part of the hash.” (Source)

Except that’s wrong.

Read the whole thing.

Comments closed

Microsoft Fabric Shapes for Excalidraw

Miles Cole has an announcement:

I’m thrilled to announce that my Microsoft Fabric Shape Library for Excalidraw has been published!

Documenting and diagramming solution architectures is critical for both conceptualizing and building maintainable data solutions. Whether you are whiteboarding ideas for review with your customer or technical lead, or documenting an existing solution so that new or future team members can understand your spaghetti architecture, high-quality diagrams are a must.

Having spent time in both web and graphic design at various points in my career, I’ve developed a stumbling block: I have a hard time feeling good about a diagram unless it looks stellar. While I’ve successfully used a few diagramming tools in the past, I’ve never fallen in love with one—until now. Enter Excalidraw.

Click through to learn more about Excalidraw and try out the icons. I’ve always been a fan of draw.io (which was also diagrams.net for a little while but appears to be back to its original name) and mermaid, but this also looks interesting.

Comments closed

Converting an Excel Workbook to CSV via Microsoft Fabric

Jared Westover builds a Data Factory job:

After a two-year break, I started working with Azure Data Factory again, now part of the Fabric family. I quickly adapted to Data Factory since it closely resembled SQL Server Integration Services (SSIS), a tool with which I had a love-hate relationship. For my new mission, I set out to convert a list of files from Excel to comma-separated values (CSV). We upload the original Excel files to a Data Lake in Fabric. We then need to convert a specific worksheet and move the CSV files to a different folder in Data Lake.

Read on to see what Jared came up with.

Comments closed