Press "Enter" to skip to content

Category: Data Science

A Primer on Data Analysis with Python and SQL Server

Eduardo Pivaral shows off a few examples of analysis techniques:

With the rise of cloud, automation and managed services, the role of the Database Administrator has pivoted towards Data Engineering.  The focus is to maintain, secure, and cleanse data in order for data analysis and decision making by the business.

How can we start using modern data analysis tools with our current SQL Server infrastructure? Further, how can we start providing end users and decision makers with important insights about our data, without spending extra money on enterprise data analysis tools?

Click through for demonstrations of k-means clustering for discerning categorical groups of data, simple demand forecasting, and generating customer segments.

Leave a Comment

From Conjecture to Hypothesis and the Failure of Data-Driven

Alexander Arvidsson does some research:

I’ve spent the last few weeks diving deep into something that’s been bothering me for years. Everyone talks about being “data-driven,” but when you actually look at what that means in practice, something doesn’t add up. Companies are knee-deep in data, wading in dashboards, drowning in reports, and yet… nothing changes.

So I went looking for examples. Real examples. Not “we implemented analytics and it was amazing” marketing fluff, but concrete cases where data actually improved outcomes. What I found was fascinating, and not at all what the analytics vendors want you to hear.

This is an interesting article and starts to get to the reason why “data-driven” companies fail to deliver on their promise. It also gets to one of my nag points around dashboards: the purpose of a dashboard is to provide relevant parties enough information, at a glance of the dashboard, to take whatever action is necessary. In order to develop a good dashboard, you need to understand all of that information: who the relevant parties are, what decision points exist, under what circumstances should an individual take action, and (ideally) what action the individual could take. But that’s a lot of information and a lot of effort to tease out the right answers.

Leave a Comment

Python Libraries for Advanced Time Series Forecasting

Ivan Palomares Carrascosa has a list:

Fortunately, Python’s ecosystem has evolved to meet this demand. The landscape has shifted from purely statistical packages to a rich array of libraries that integrate deep learning, machine learning pipelines, and classical econometrics. But with so many options, choosing the right framework can be overwhelming.

This article cuts through the noise to focus on 5 powerhouse Python libraries designed specifically for advanced time series forecasting. We move beyond the basics to explore tools capable of handling high-dimensional data, complex seasonality, and exogenous variables. For each library, we provide a high-level overview of its standout features and a concise “Hello World” code snippet to familiarize yourself immediately.

Click through for an explanation of each of the five libraries.

Leave a Comment

How Data Leakage Can Hurt Model Performance

Ivan Palomares Carrascosa leaks some data:

In this article, you will learn what data leakage is, how it silently inflates model performance, and practical patterns for preventing it across common workflows.

Topics we will cover include:

  • Identifying target leakage and removing target-derived features.
  • Preventing train–test contamination by ordering preprocessing correctly.
  • Avoiding temporal leakage in time series with proper feature design and splits.

Read on to learn more.

Comments closed

Using Haskell for Data Science

Jonathan Carroll has my attention:

I’ve been learning Haskell for a few years now and I am really liking a lot of the features, not least the strong typing and functional approach. I thought it was lacking some of the things I missed from R until I found the dataHaskell (www.datahaskell.org) project.

There have been several attempts recently to enhance R with some strong types, e.g.  vapour (vapour.run), typr (github.com), using {rlang}’s checks (josiahparry.com), and even discussions about implementations at the core level e.g.  in September 2025 (stat.ethz.ch) continued in November 2025 (stat.ethz.ch). While these try to bend R towards types, perhaps an all-in solution makes more sense.

In this post I’ll demonstrate some of the features and explain why I think it makes for a good (great?) data science language.

I’ve been a big fan of F# for data science work as well for similar reasons, so it was interesting to read this article on Haskell. H/T R-Bloggers.

Comments closed

When Decision Trees Fail

Ivan Palomares Carrascosa builds an explanation:

In this article, you will learn why decision trees sometimes fail in practice and how to correct the most common issues with simple, effective techniques.

Topics we will cover include:

  • How to spot and reduce overfitting in decision trees.
  • How to recognize and fix underfitting by tuning model capacity.
  • How noisy or redundant features mislead trees and how feature selection helps.

Read on for some of the perils of CART and some ways to resolve them.

Comments closed

Four Measures for Vector Search Quality

Joe Sack explains four important measures:

You type “3-bedroom townhouse near a good school” into a home search site. It shows 10 homes. Some perfect, some okay, some wrong. How do you know if it’s working?

Four numbers help with this: Precision (what proportion is relevant), Recall (what you missed), MRR (how far to the first relevant result), nDCG (best stuff first).

Read on to learn what each one means and how it applies to vector search.

Comments closed

Pulling Random Values from a Gaussian Distribution in T-SQL

Sebastiao Pereira has another way of populating a random variable:

Generating random numbers from a normal distribution is essential for accuracy and realistic modeling. Used for simulation, inference, cryptography, and algorithm design for scientific, engineering, statistical, and AI domains. Is it possible to create random Gaussian numbers in SQL Server using the Ziggurat algorithm without external tools?

I was not familiar with this technique, so it’s neat to see it in action.

Comments closed

Calculating Exponential Moving Average in T-SQL

Rick Dobson watches the flow:

Exponential moving averages (emas) are a powerful means of detecting changes in time series data. However, if you are new to this task, you may be wondering how to choose from conflicting advice about how to calculate emas. This tip reviews several of the most popular methods for calculating moving averages. Additionally, this tip presents T-SQL code samples with common table expressions and stored procedures for generating emas from an underlying time series dataset.

“Emas don’t just track trends—they reveal momentum in motion.” That’s why they’re favored when recent values matter most—and why this tip focuses on helping you calculate them with precision.

Read on for the formula and a couple of lengthy scripts to generate it.

Comments closed