Press "Enter" to skip to content

Category: Data Science

Scoring the Quality of Binary Classification with SQL Server

Sebastiao Pereira quantifies a result:

Machine Learning (ML) is a way of teaching computers to learn from data instead of being explicitly programmed. Performance metrics are essential tools for understanding how well a model actually works. They tell you not just how accurate the model is, but how reliablefair, and useful it will be in real-world applications. In other words, without them, machine learning would be a trial-and-error guesswork.

Binary classification is when each sample is labeled as one of two mutually exclusive classes, referenced to a categorization, like positive or negative.

How do you implement the binary classification performance metric in SQL Server without using external tools?

Click through for a series of metrics to determine how well a binary classification process performed. This post doesn’t include details on how to perform the classification, just what to do once you have the results.

Leave a Comment

When R^2 Misleads

Holger von Jouanne-Diedrich explains a common quality metric for regression analyses:

A high R^2 can make a regression model look impressively accurate — but this number can be deceptive. If you want to understand why a high R^2 is not always a sign of a good model, read on!

Click through for that explanation. This post does a fantastic job of explaining the technical reasons why a high R^2 might not be indicative of a good model specification. But I’d add one other piece to the puzzle: what constitutes a high R^2 will depend very much on the domain. For example, if you are performing a regression of some process in physics, an R^2 of 0.90 is probably so low as to indicate you’ve made a horrible mistake somewhere to have a number so low.

By contrast, an R^2 of 0.90 in the context of a social studies analysis would get you laughed out of the room for obviously faking the data or misunderstanding the specification to get a number that high.

Leave a Comment

Training, Serving, and Deploying Scikit-Learn Models via FastAPI

Abid Ali Awan serves a model:

In this article, you will learn how to train a Scikit-learn classification model, serve it with FastAPI, and deploy it to FastAPI Cloud.

Topics we will cover include:

  • How to structure a simple project and train a Scikit-learn model for inference.
  • How to build and test a FastAPI inference API locally.
  • How to deploy the API to FastAPI Cloud and prepare it for more production-ready usage.

Click through for the process.

Comments closed

Cross-Workspace MLflow Logging Available in Microsoft Fabric

Ruixin Xu announces a feature now generally available:

Cross-workspace logging works through the synapseml-mlflow package, which provides a Fabric-compatible MLflow tracking plugin. The core idea is simple: set the MLFLOW_TRACKING_URI* to point at your target workspace and use standard MLflow commands. Your experiments, metrics, parameters, and registered models land in the workspace you choose — not just the one you’re running in.

Read on for the full announcement.

Comments closed

Implementing SOFTMAX in SQL Server

Sebastiao Pereira is back with another formula:

The SOFTMAX function takes raw scores and converts into a probability distribution. This mathematical function is used in neural networking training, multiclass classification methods, multinomial logistic regression, multiclass linear discriminant analysis, and naïve Bayes classifiers. How can this function be built in SQL Server?

Click through for the implementation.

Comments closed

Updating a Mean without Original Data Points

John Cook has an interesting solution:

This post will look at the problem of updating an average grade as a very simple special case of Bayesian statistics and of Kalman filtering.

Suppose you’re keeping up with your average grade in a class, and you know your average after n tests, all weighted equally.

Click through for the walkthrough. This is similar to something I tried to puzzle out but ultimately admitted defeat: is there a way to calculate updates to the median without needing to know the entire set? In practical terms, this would be something like, how many pieces of information do I need to guarantee that I can maintain a median over time?

The best I could come up with was built along the premise of the likelihood of new data points being less than the median versus those greater than the median, where each pair of greater-lesser cancel each other out. If you have roughly equal numbers of new data points to each side, your “elements of the median” array can be pretty small. But the problem is, for any sufficiently small k, where k represents the number of elements you keep in memory, it is possible for a localized collection of (without loss of generality) lower-than-median data points to come in and completely wash out your memory. For example, if you kept 3 points and memory and you have four values below the median, you no longer know what the median is.

Trying to solve this without knowing the shape of the distribution or make any sequencing assumptions is something that I failed to do.

Comments closed

Implementing Shamir’s Secret Sharing in SQL Server

Sebastiao Pereira implements an algorithm:

Shamir’s Secret Sharing is a cryptographic algorithm that allows a secret to be split into multiple components and shared among a group in such a way that the secret can only be revealed if a minimum number of components are combined. Is it possible to have this algorithm implemented in SQL Server without using external tools?

Click through for a T-SQL implementation, as well as one using CLR.

Comments closed

A Review of the Portmanteau Theorem

Ben Smith digs into a theorem:

The Portmanteau Theorem provides a set of equivalences of weak convergence that still remains relevant for establishing asymptotic results in probability and statistics. While the theory around weak convergence is well developed, I was inspired to put together a writeup proving all the equivalences in a self contained manner, by first presenting the relevant theorems applied (without proving them) along with along with a visual on the implication cycle created for the proof and some discussion about other presentations available in popular textbooks and some historical notes.

Click through for the PDF.

Comments closed

Predictive Analytics with Power BI and Microsoft Fabric

Ruixin Xu puts together a how-to guide:

Across industries, teams use Power BI to understand what has already happened. Dashboards show trends, highlight performance, and keep organizations aligned around a shared view of the business.

But leaders are asking new questions—not just what happened, but what is likely next and how outcomes might change if they act. They want insights that help teams prioritize, intervene earlier, and focus effort where it matters. This is why many organizations look to enrich Power BI reports with machine learning.

This challenge is especially common in financial services.

Consider a bank that uses Power BI to track customer activity, balances, and service usage. Historical analysis shows that around 20% of customers churn, with churn tied to factors such as customer tenure, product usage, service interactions, and balance changes.

Click through for the architecture example and process. The actual model is a LightGBM model, which is generally fine for two-class classification.

Comments closed

Choosing between PCA and t-SNE

Shittu Olumide visualizes some data:

For data scientists, working with high-dimensional data is part of daily life. From customer features in analytics to pixel values in images and word vectors in NLP, datasets often contain hundreds and thousands of variables. Visualizing such complex data is difficult.

That’s where dimensionality reduction techniques come in. Two of the most widely used methods are Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). While both reduce dimensions, they serve very different goals.

The thing that ultimately soured me on t-SNE is the stochastic nature. You can run the same set of operations multiple times and get significantly different results. It’s really easy to use and the output graphs are really pretty, but if you can’t trust the outputs to be at least somewhat stable, there’s a hard limit to its value.

Comments closed