Press "Enter" to skip to content

Author: Kevin Feasel

ML Services: PYTHONHOME and PATH

Niels Berglund troubleshoots some issues:

In the last post, which looks at using Python 3.9 in SQL Server Machine Learning Services, I wrote this at the very end:

It looks like all is good, but maybe not? In a future post we’ll look at an issue we have introduced – but for now, let us bask in the glory of having created a new Python language extension.

In the post, we wrote a new language extension to handle Python 3.9, and that just worked fine. However, when I was doing some other things, I noticed some side effects, and in this post, we look at those side effects and how to solve them.

Click through to learn more.

Comments closed

The Nature of Overfitting

John Mount has a nice essay on overfitting:

What is meant by “overfitting” is: the estimated f() will tend to show off or over perform on the data used to fit, train, or construct it. I have some notes on this sort of selection bias here: https://win-vector.com/2020/12/10/overfit-and-reversion-to-mediocrity-the-bane-of-data-science/.

Selecting a model that “looks good” is enough to bias the model’s evaluation with respect to the data set we said it “looked good” on. So even when using unbiased methods, the data scientist can introduce bias by choosing to use one model (say the one fit by logistic regression) over another (say using using an observed prevalence everywhere as a probability prediction).

The way I talk about overfitting is to say that we’ve trained a model which latches onto the particulars of the training data set. To the extent that the particulars of the training data set are matched by the broader world, that’s “fitting.” To the extent that the particulars of the training data set are unique to that data set and are not generally applicable, that’s “overfitting.” Generally, I don’t have any more time to get into what this means, but John dives into the topic in an accessible way.

Comments closed

Running an mlflow Server on Azure

Paul Hernandez configures mlflow on Azure using platform-as-a-service offerings:

It is indisputable true that mlflow came to make life a lot easier not only for data scientists but also for data engineers, architects among others. There is a very helpful list of tutorials and example in the official mlflow docs. You can just download it, open a console and start using it locally on your computer. This is the fastest way to getting started. However, as soon as you progress and introduce mlflow in your team, or you want to use it extensively for yourself, some components should be deployed outside your laptop.

To exercise a deployment setup and since I own azure experience, I decided to provision a couple of resources in the cloud to deploy the model registry and store the data produced by the tracking server.

I concur on the power of mlflow.

Comments closed

Extracting Refresh Metrics for a Power BI Workspace

Marc Lelijveld wants to pull some metrics:

In the Power BI service, you can easily look at refresh times for an individual dataset or dataflow. There are many different reasons why these metrics are important to you as a dataset or dataflow owner. For example, you may bump into refresh time-outs or unfortunate errors. There are many good reasons to think about why you want to have more insights in your refresh metrics.

Having that said, it can be a pain to look at these metrics every day. Power BI already offers a way to send automatic notification in case of a refresh failure. Though, I would personally prefer to have more insights in all my refresh metrics, whether they are failing or succeeding.

In this blog I want to share a way how you can export all refresh metrics for your datasets and dataflows using a PowerShell script.

Click through to see how and to get a copy of the script.

Comments closed

Transforming Arrays in Azure Data Factory

Mark Kromer shows off a few functions in Azure Data Factory to modify data in arrays:

The first transformation function is map() and allows you to apply data flow scalar functions as the 2nd parameter to the map() function. In my case, I use upper() to uppercase every element in my string array: map(columnNames(),upper(#item))

Read on for more iteration and aggregation functions akin to map, reduce, and filter.

Comments closed

String Modification in T-SQL

Steve Jones answers a question:

Recently I ran across a question posted by a beginner on the Internet and thought this would be a good, basic topic to cover. The question was: how can I replace a value in a comma separated string in a table?

This post covers the basics of this task.

Incidentally, this is where I say hey, that sounds like a failure in normalization. If you need to care about individual values in a collection, your value is not atomic. But that’s a bit of a tangent.

Comments closed

When to Use Event Sourcing

Vikas Hazrati takes us through the pros and cons of using event sourcing for a project:

You would always get a ton of literature on Event Sourcing and CQRS. The key question is WHEN do you use it? Under what circumstances? Is your problem really in need of ES?

I would not go into the details of what Event Sourcing and CQRS is. The industry stalwarts have covered that in adequate detail. This post delves into battle-tested scenarios on where we should have used and otherwise ignored ES.

Click through for an analysis of pros and cons, as well as some advice on what it all means.

Comments closed

Configuring a Linked Server to Oracle

Emanuele Meazzo needs to pull data from Oracle into SQL Server:

The most atrocious part of my search for glory was without doubt navigating all the packages to download and install for each component, between broken links and differences between the instructions and the actual content, it’s a mess.

It took a while, based on Emanuele’s tone. With SQL Server 2019, you can avoid some of this pain by using PolyBase. But for prior versions of SQL Server, your options are more limited.

Comments closed