Press "Enter" to skip to content

Category: Data Science

Implementing SOFTMAX in SQL Server

Sebastiao Pereira is back with another formula:

The SOFTMAX function takes raw scores and converts into a probability distribution. This mathematical function is used in neural networking training, multiclass classification methods, multinomial logistic regression, multiclass linear discriminant analysis, and naïve Bayes classifiers. How can this function be built in SQL Server?

Click through for the implementation.

Leave a Comment

Updating a Mean without Original Data Points

John Cook has an interesting solution:

This post will look at the problem of updating an average grade as a very simple special case of Bayesian statistics and of Kalman filtering.

Suppose you’re keeping up with your average grade in a class, and you know your average after n tests, all weighted equally.

Click through for the walkthrough. This is similar to something I tried to puzzle out but ultimately admitted defeat: is there a way to calculate updates to the median without needing to know the entire set? In practical terms, this would be something like, how many pieces of information do I need to guarantee that I can maintain a median over time?

The best I could come up with was built along the premise of the likelihood of new data points being less than the median versus those greater than the median, where each pair of greater-lesser cancel each other out. If you have roughly equal numbers of new data points to each side, your “elements of the median” array can be pretty small. But the problem is, for any sufficiently small k, where k represents the number of elements you keep in memory, it is possible for a localized collection of (without loss of generality) lower-than-median data points to come in and completely wash out your memory. For example, if you kept 3 points and memory and you have four values below the median, you no longer know what the median is.

Trying to solve this without knowing the shape of the distribution or make any sequencing assumptions is something that I failed to do.

Leave a Comment

Implementing Shamir’s Secret Sharing in SQL Server

Sebastiao Pereira implements an algorithm:

Shamir’s Secret Sharing is a cryptographic algorithm that allows a secret to be split into multiple components and shared among a group in such a way that the secret can only be revealed if a minimum number of components are combined. Is it possible to have this algorithm implemented in SQL Server without using external tools?

Click through for a T-SQL implementation, as well as one using CLR.

Comments closed

A Review of the Portmanteau Theorem

Ben Smith digs into a theorem:

The Portmanteau Theorem provides a set of equivalences of weak convergence that still remains relevant for establishing asymptotic results in probability and statistics. While the theory around weak convergence is well developed, I was inspired to put together a writeup proving all the equivalences in a self contained manner, by first presenting the relevant theorems applied (without proving them) along with along with a visual on the implication cycle created for the proof and some discussion about other presentations available in popular textbooks and some historical notes.

Click through for the PDF.

Comments closed

Predictive Analytics with Power BI and Microsoft Fabric

Ruixin Xu puts together a how-to guide:

Across industries, teams use Power BI to understand what has already happened. Dashboards show trends, highlight performance, and keep organizations aligned around a shared view of the business.

But leaders are asking new questions—not just what happened, but what is likely next and how outcomes might change if they act. They want insights that help teams prioritize, intervene earlier, and focus effort where it matters. This is why many organizations look to enrich Power BI reports with machine learning.

This challenge is especially common in financial services.

Consider a bank that uses Power BI to track customer activity, balances, and service usage. Historical analysis shows that around 20% of customers churn, with churn tied to factors such as customer tenure, product usage, service interactions, and balance changes.

Click through for the architecture example and process. The actual model is a LightGBM model, which is generally fine for two-class classification.

Comments closed

Choosing between PCA and t-SNE

Shittu Olumide visualizes some data:

For data scientists, working with high-dimensional data is part of daily life. From customer features in analytics to pixel values in images and word vectors in NLP, datasets often contain hundreds and thousands of variables. Visualizing such complex data is difficult.

That’s where dimensionality reduction techniques come in. Two of the most widely used methods are Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). While both reduce dimensions, they serve very different goals.

The thing that ultimately soured me on t-SNE is the stochastic nature. You can run the same set of operations multiple times and get significantly different results. It’s really easy to use and the output graphs are really pretty, but if you can’t trust the outputs to be at least somewhat stable, there’s a hard limit to its value.

Comments closed

Implementing the OPTICS Clustering Algorithm in SQL Server

Sebastiao Pereira implements an algorithm:

Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based clusters very similar do DBSCAN. However, OPTICS handles it more effectively in the case of a cluster with varying densities gaining deeper insights exploring the hierarchical structure of your data. This algorithm is generally more computationally intensive.

Is it possible to have the OPTICS clustering algorithm implemented in SQL Server without using an external solution?

Click through for that implementation.

Comments closed

Operating on Distributions in R with distionary

Vincenzo Cola announces a new R package:

After passing through rOpenSci peer review, the distionary package is now newly available on CRAN. It allows you to make probability distributions quickly – either from a few inputs or from its built-in library – and then probe them in detail.

These distributions form the building blocks that piece together advanced statistical models with the wider probaverse ecosystem, which is built to release modelers from low-level coding so production pipelines stay human-friendly. Right now, the other probaverse packages are distplyr, allowing you to morph distributions into new forms, and famish, allowing you to tune distributions to data. Developed with risk analysis use cases like climate and insurance in mind, the same tools translate smoothly to simulations, teaching, and other applied settings.

Click through for an overview of the package.

Comments closed

A Primer on Data Analysis with Python and SQL Server

Eduardo Pivaral shows off a few examples of analysis techniques:

With the rise of cloud, automation and managed services, the role of the Database Administrator has pivoted towards Data Engineering.  The focus is to maintain, secure, and cleanse data in order for data analysis and decision making by the business.

How can we start using modern data analysis tools with our current SQL Server infrastructure? Further, how can we start providing end users and decision makers with important insights about our data, without spending extra money on enterprise data analysis tools?

Click through for demonstrations of k-means clustering for discerning categorical groups of data, simple demand forecasting, and generating customer segments.

Comments closed