Press "Enter" to skip to content

Category: Data Science

A Review of the Portmanteau Theorem

Ben Smith digs into a theorem:

The Portmanteau Theorem provides a set of equivalences of weak convergence that still remains relevant for establishing asymptotic results in probability and statistics. While the theory around weak convergence is well developed, I was inspired to put together a writeup proving all the equivalences in a self contained manner, by first presenting the relevant theorems applied (without proving them) along with along with a visual on the implication cycle created for the proof and some discussion about other presentations available in popular textbooks and some historical notes.

Click through for the PDF.

Leave a Comment

Predictive Analytics with Power BI and Microsoft Fabric

Ruixin Xu puts together a how-to guide:

Across industries, teams use Power BI to understand what has already happened. Dashboards show trends, highlight performance, and keep organizations aligned around a shared view of the business.

But leaders are asking new questions—not just what happened, but what is likely next and how outcomes might change if they act. They want insights that help teams prioritize, intervene earlier, and focus effort where it matters. This is why many organizations look to enrich Power BI reports with machine learning.

This challenge is especially common in financial services.

Consider a bank that uses Power BI to track customer activity, balances, and service usage. Historical analysis shows that around 20% of customers churn, with churn tied to factors such as customer tenure, product usage, service interactions, and balance changes.

Click through for the architecture example and process. The actual model is a LightGBM model, which is generally fine for two-class classification.

Leave a Comment

Choosing between PCA and t-SNE

Shittu Olumide visualizes some data:

For data scientists, working with high-dimensional data is part of daily life. From customer features in analytics to pixel values in images and word vectors in NLP, datasets often contain hundreds and thousands of variables. Visualizing such complex data is difficult.

That’s where dimensionality reduction techniques come in. Two of the most widely used methods are Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). While both reduce dimensions, they serve very different goals.

The thing that ultimately soured me on t-SNE is the stochastic nature. You can run the same set of operations multiple times and get significantly different results. It’s really easy to use and the output graphs are really pretty, but if you can’t trust the outputs to be at least somewhat stable, there’s a hard limit to its value.

Leave a Comment

Implementing the OPTICS Clustering Algorithm in SQL Server

Sebastiao Pereira implements an algorithm:

Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based clusters very similar do DBSCAN. However, OPTICS handles it more effectively in the case of a cluster with varying densities gaining deeper insights exploring the hierarchical structure of your data. This algorithm is generally more computationally intensive.

Is it possible to have the OPTICS clustering algorithm implemented in SQL Server without using an external solution?

Click through for that implementation.

Leave a Comment

Operating on Distributions in R with distionary

Vincenzo Cola announces a new R package:

After passing through rOpenSci peer review, the distionary package is now newly available on CRAN. It allows you to make probability distributions quickly – either from a few inputs or from its built-in library – and then probe them in detail.

These distributions form the building blocks that piece together advanced statistical models with the wider probaverse ecosystem, which is built to release modelers from low-level coding so production pipelines stay human-friendly. Right now, the other probaverse packages are distplyr, allowing you to morph distributions into new forms, and famish, allowing you to tune distributions to data. Developed with risk analysis use cases like climate and insurance in mind, the same tools translate smoothly to simulations, teaching, and other applied settings.

Click through for an overview of the package.

Comments closed

A Primer on Data Analysis with Python and SQL Server

Eduardo Pivaral shows off a few examples of analysis techniques:

With the rise of cloud, automation and managed services, the role of the Database Administrator has pivoted towards Data Engineering.  The focus is to maintain, secure, and cleanse data in order for data analysis and decision making by the business.

How can we start using modern data analysis tools with our current SQL Server infrastructure? Further, how can we start providing end users and decision makers with important insights about our data, without spending extra money on enterprise data analysis tools?

Click through for demonstrations of k-means clustering for discerning categorical groups of data, simple demand forecasting, and generating customer segments.

Comments closed

From Conjecture to Hypothesis and the Failure of Data-Driven

Alexander Arvidsson does some research:

I’ve spent the last few weeks diving deep into something that’s been bothering me for years. Everyone talks about being “data-driven,” but when you actually look at what that means in practice, something doesn’t add up. Companies are knee-deep in data, wading in dashboards, drowning in reports, and yet… nothing changes.

So I went looking for examples. Real examples. Not “we implemented analytics and it was amazing” marketing fluff, but concrete cases where data actually improved outcomes. What I found was fascinating, and not at all what the analytics vendors want you to hear.

This is an interesting article and starts to get to the reason why “data-driven” companies fail to deliver on their promise. It also gets to one of my nag points around dashboards: the purpose of a dashboard is to provide relevant parties enough information, at a glance of the dashboard, to take whatever action is necessary. In order to develop a good dashboard, you need to understand all of that information: who the relevant parties are, what decision points exist, under what circumstances should an individual take action, and (ideally) what action the individual could take. But that’s a lot of information and a lot of effort to tease out the right answers.

Comments closed

Python Libraries for Advanced Time Series Forecasting

Ivan Palomares Carrascosa has a list:

Fortunately, Python’s ecosystem has evolved to meet this demand. The landscape has shifted from purely statistical packages to a rich array of libraries that integrate deep learning, machine learning pipelines, and classical econometrics. But with so many options, choosing the right framework can be overwhelming.

This article cuts through the noise to focus on 5 powerhouse Python libraries designed specifically for advanced time series forecasting. We move beyond the basics to explore tools capable of handling high-dimensional data, complex seasonality, and exogenous variables. For each library, we provide a high-level overview of its standout features and a concise “Hello World” code snippet to familiarize yourself immediately.

Click through for an explanation of each of the five libraries.

Comments closed

How Data Leakage Can Hurt Model Performance

Ivan Palomares Carrascosa leaks some data:

In this article, you will learn what data leakage is, how it silently inflates model performance, and practical patterns for preventing it across common workflows.

Topics we will cover include:

  • Identifying target leakage and removing target-derived features.
  • Preventing train–test contamination by ordering preprocessing correctly.
  • Avoiding temporal leakage in time series with proper feature design and splits.

Read on to learn more.

Comments closed