Press "Enter" to skip to content

Author: Kevin Feasel

Efficient Sampling of Spark Datasets

Rajesh Vakkalagadda needs a sample:

Sampling is a fundamental process in machine learning that involves selecting a subset of data from a larger dataset. This technique is used to make training and evaluation more efficient, especially when working with massive datasets where processing every data point is impractical

However, sampling comes with its own challenges. Ensuring that samples are representative is crucial to prevent biases that could lead to poor model generalization and inaccurate evaluation results. The sample size must strike a balance between performance and resource constraints. Additionally, sampling strategies need to account for factors such as class imbalance, temporal dependencies, and other dataset-specific characteristics to maintain data integrity.

Click through for an answer in Scala. The Python implementation would be very similar,

Leave a Comment

Operating on Distributions in R with distionary

Vincenzo Cola announces a new R package:

After passing through rOpenSci peer review, the distionary package is now newly available on CRAN. It allows you to make probability distributions quickly – either from a few inputs or from its built-in library – and then probe them in detail.

These distributions form the building blocks that piece together advanced statistical models with the wider probaverse ecosystem, which is built to release modelers from low-level coding so production pipelines stay human-friendly. Right now, the other probaverse packages are distplyr, allowing you to morph distributions into new forms, and famish, allowing you to tune distributions to data. Developed with risk analysis use cases like climate and insurance in mind, the same tools translate smoothly to simulations, teaching, and other applied settings.

Click through for an overview of the package.

Leave a Comment

LOB Data and Replication in SQL Server

Mark Beaumont diagnoses an error:

Recently, one of our clients encountered an issue while running a data update in SQL Server. The operation failed immediately with a configuration error, specifically targeting Large Object (LOB) data:

Length of LOB data (169,494) to be replicated exceeds configured maximum 65,536. Use the stored procedure sp_configure to increase the configured maximum value for max text repl size option, which defaults to 65,536. A configured value of -1 indicates no limit, other than the limit imposed by the data type.

The tricky part was, that client wasn’t using replication. Read on to learn about the culprit.

Leave a Comment

A Primer on Cognitive Perception

Paul Turley thinks about how we think:

You can be the greatest report designer on the planet, but if your report doesn’t meet the needs of the report consumer, it’s all for nothing. In this section, I break down the most important considerations for identifying your audience and their information needs. These are all factors to consider before you jump in and start designing your report.

Paul hits on quite a few of the foundational concepts around how humans visual stimuli and tells some interesting stories along the way.

Leave a Comment

Tips for Teaching Technical Topics

John Deardurff shares some advice:

After 25 years as a Microsoft Certified Trainer (MCT), one thing I have learned is that teaching technical content requires more than just subject‑matter expertise. Great technical instructors create an environment where learners feel comfortable, engaged, and motivated to explore complex concepts at their own pace. 

Click through for ten such tips. I tend to follow seven of them pretty well, though the three around questions are where I’m weakest.

Leave a Comment

Granular REST API Support for OneLake Security Role Management

Aaron Merrill announces a new preview offering:

Microsoft Fabric continues to expand the OneLake security surface with new granular REST API support for role management, giving developers and platform teams far more control over how security policies are created, retrieved, and managed programmatically. In addition to the existing batch role API, Fabric now offers discrete Create, Get, and Delete role APIs, making it easier to build incremental, automation-friendly security workflows that align with modern DevOps and governance practices.

Click through for a quick explanation of how things did work and how they will work going forward.

Leave a Comment

Hosting an ML Model with FastAPI

Kanwal Mehreen hosts a model:

In this article, you will learn how to package a trained machine learning model behind a clean, well-validated HTTP API using FastAPI, from training to local testing and basic production hardening.

Topics we will cover include:

  • Training, saving, and loading a scikit-learn pipeline for inference
  • Building a FastAPI app with strict input validation via Pydantic
  • Exposing, testing, and hardening a prediction endpoint with health checks

Let’s explore these techniques. 

I definitely enjoy how simple it is to use FastAPI.

Leave a Comment

Misconceptions about Microsoft Certification Exams

Greg Low clears the air:

Over the years, I’ve taken a lot of Microsoft exams. I’ve also spent a lot of time writing exams for Microsoft exam providers. And while I’ve been doing that, I’ve spent a lot of time in forums where I’ve been checking out what people say about the exams. 

What amazes me is the number of misconceptions that people have about these exams. So, I thought it would be helpful to write about the most common ones. Unlike what I see (but shouldn’t see) in the forums, I can’t discuss specific questions, but the majority of this is unrelated to the actual questions or the specific exams. 

Read on to learn more. One thing Greg touches on en passant is quickly-updating information. This is one of the trickiest parts of Microsoft exams, especially in certain fields like AI: sometimes you’ll find a question that was written two versions of a product ago (i.e., 6 months ago) and now you have to guess whether you give the answer that is correct today or the answer that was correct then. I know they try to keep these exams up to date, but it’s hard to do against a moving target.

Leave a Comment

Improvements to Microsoft Fabric Real-Time Dashboards

Michal Bar makes an announcement:

Performance matters—especially when you’re exploring live data and making decisions in real time. We have released a set of improvements, all aiming to make Real-Time Dashboards faster, smoother, and more responsive, based directly on what our customers and community told us.

Read on to see what has changed.

What hasn’t changed is my complaint about the term “real-time.” But let’s be honest: I realize it’s a war I’m not going to win.

Leave a Comment

Checking SQL Server Availability Groups

Jeff Iannucci announces a new procedure:

SQL Server Availability Groups can be a great feature to help support your High Availability needs, but what happens when they fail to work as expected?

Do you have an expiring certificate on used by an endpoint? Do you have timeout settings that could contribute to unexpected failovers? Are you suffering from a high number of HADR_SYNC_COMMIT waits?

We’ve seen all those things happen, and like Marvin Gaye we’ve wondered: what’s going on? And we’ve wanted a tool to help us see if other clients were having these problems, and more.

Read on for more information and check it out yourself.

Leave a Comment