Category: Python

Databricks Autologging

Published 2021-08-31 by Kevin Feasel

Corey Zumar and Kasey Uhlenhuth announce a new product:

Machine learning teams require the ability to reproduce and explain their results–whether for regulatory, debugging or other purposes. This means every production model must have a record of its lineage and performance characteristics. While some ML practitioners diligently version their source code, hyperparameters and performance metrics, others find it cumbersome or distracting from their rapid prototyping. As a result, data teams encounter three primary challenges when recording this information: (1) standardizing machine learning artifacts tracked across ML teams, (2) ensuring reproducibility and auditability across a diverse set of ML problems and (3) maintaining readable code across many logging calls.

Read on to see how Databricks Autologging can satisfy these issues.

Comments closed

Projecting Disk Space Available

Published 2021-08-20 by Kevin Feasel

Constantine Kokkinos predicts the future:

The first question I wanted to model out was a bigger issue with on-premises databases – when are we going to run out of storage?
Back in the day I’d cheat with msdb backups, comparing compressed sized to actuals, and moving on. However I don’t have a historical reference for Stack Overflow… so what can I do?
Taking a look at the tables we see a commonality in many tables – CreationDate! It looks like the rows faithfully are stamped when they are created.

Constantine does at the end hit on something we tend to forget: most operations in life aren’t quite linear. We often get lucky in that certain stretches are close enough to be linear that we can model them that way, but even in this dataset, you can see the effects of polynomial growth slowly build up. Still, this is a good way of taking us through what an analysis and projection can look like.

Comments closed

Django Support for SQL Server

Published 2021-08-18 by Kevin Feasel

Warren Chu announces a 1.0 version of a new product:

We’re officially announcing the release of mssql-django v1.0 as an open source project!
At Microsoft we’ve heard from the community loud and clear – SQL Server is the biggest enterprise backend not yet fully supported in Django.
That’s about to change.
This project picks up where previous open source projects have left off. We began with a series of preview releases in February 2021, and we’re pleased to officially bring Microsoft support to SQL Server and Azure SQL DB with this version’s official release.

Django is still a fairly popular platform, so I’m happy to see this released.

Comments closed

Importing SQL Server Extended Properties into Azure Purview

Published 2021-08-12 by Kevin Feasel

Daniel Janik shows how you can use PyApacheAtlas to move specific SQL Server extended properties into Azure Purview:

This post is going to be restricted to only SQL Server Table Columns and only Extended Properties named MS_Description. Quite a few years ago I worked on a data catalog project where we added descriptions for many of the tables, views, and columns to the database using extended properties named MS_Description. Let’s assume you have some of these for this post keeping in mind that the Purview APIs provide so many functions beyond what this post covers and that the code here could be modified to do so much more as well.
Starting out I thought it would be great to import the sensitivity classifications that SSMS creates. Pre-SQL 2019 these were held in Extended Properties and now have their very own DMV (sys.sensitivity_classifications). While this sounded great in theory it wasn’t as exciting when I wrote the code. This is because Azure Purview already has system classifications at a more granular scale for each of the ones you find in SSMS and Purview also adds these as it executes a scan on the data source. It does a pretty good job too. With that said, I shifted my focus to adding descriptions instead.

Read on to see how you can do this.

Comments closed

Orchestrating ML Pipelines with Amazon Managed Workflows for Airflow

Published 2021-08-03 by Kevin Feasel

Juston Leto, et al, show off MLOps capabilities in AWS:

The ability to scale machine learning operations (MLOps) at an enterprise is quickly becoming a competitive advantage in the modern economy. When firms started dabbling in ML, only the highest priority use cases were the focus. Businesses are now demanding more from ML practitioners: more intelligent features, delivered faster, and continually maintained over time. An effective MLOps strategy requires a unified platform that can orchestrate and automate complex data processing and ML tasks, and integrates with the latest tooling to best complete those tasks.
This post demonstrates the value of using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate an ML pipeline using the popular XGBoost (eXtreme Gradient Boosting) algorithm. For more advanced and comprehensive MLOps capabilities, including a purpose-built model orchestration framework and a continuous integration and continuous delivery (CI/CD) service for ML, readers are encouraged to check out Amazon SageMaker Pipelines.

Read on for a step-by-step tutorial on the process.

Comments closed

Explaining an ML Model with SHAP

Published 2021-07-29 by Kevin Feasel

Dan Lantos, et al, walk us through one technique for model explainability:

Interpretability has to do with how accurately a machine learning model can associate a cause (input) to an effect (output).
Explainability on the other hand is the extent to which the internal mechanics of a machine or deep learning system can be explained in human terms. Or to put it simply, explainability is the ability to explain what is happening.
Let’s consider a simple example illustrated below where the goal of the machine learning model is to classify an animal into its respective groups. We use an image of a butterfly as input into the machine learning model. The model would classify the butterfly as either an insect, mammal, fish, reptile or bird. Typically, most complex machine learning models would provide a classification without explaining how the features contributed to the result. However, using tools that help with explainability, we can overcome this limitation. We can then understand what particular features of the butterfly contributed to it being classified as an insect. Since the butterfly has six legs, it is thus classified as an insect.
Being able to provide a rationale behind a model’s prediction would give the users (and the developers) confidence about the validity of the model’s decision.

Read on to see how you can use a library called SHAP in Python to help with this explainability.

Comments closed

Reinforcement Learning and Python 3

Published 2021-07-21 by Kevin Feasel

I have a new post up:

I finally got around to trying out a reinforcement learning exercise this weekend in an attempt to learn about the technique. One of the most interesting blog posts I read is Andrej Karpathy’s post on using reinforcement learning to play Pong on the Atari 2600. In it, Andrej uses the Gym package in Python to play the game.
This won’t be a post diving into the details of how reinforcement learning works; Andrej does that far better than I possibly could, so read the post. Instead, the purpose of this post is to provide a minor update to Andrej’s code to switch it from Python 2 to Python 3. In doing this, I went with the most convenient answer over a potentially better solution (e.g., switching xrange() to range() rather then re-working the code), but it does work. I also bumped up the learning rate a little bit to pick up the pace a bit.

Click through for the (slightly) updated code.

Comments closed

Contrasting Scala and Python wrt Spark

Published 2021-07-20 by Kevin Feasel

Sanjay Rathore contrasts two of the three key Apache Spark languages:

Imagine the first day of a new Apache Spark project. The project manager looks at the team and says: which one to choose, scala or python. So let’s start with “scala vs python for spark”.
You may wonder if this is a tricky question. What does the enterprise demand say? Is this like asking iOS or Android? Is there a right or wrong answer?
So we are here to inform and provide clarity. Today we’re looking at two popular programming languages, Scala and Python, and comparing them in the context of Apache Spark and Big Data in general.

Read on for the comparison. I’m at a point where I think it’s wise to know both languages and roll with whichever is there. If you’re in a greenfield Spark implementation, pick the one you (or your team) is more comfortable with. If you’re equally comfortable with the two, pick Scala because it’s a functional programming language and those are neat.

Comments closed

Time Series Estimation with Facebook’s Prophet

Published 2021-07-12 by Kevin Feasel

Dan Lantos looks at the Prophet library:

This article (part of a short series) aims to introduce the Prophet library, discuss it at a high level and run through a basic example of forecasting the FTSE 100 index. Future articles will discuss exactly how Prophet achieves its results, how to interpret the output and how to improve the model.
Please see this article (by my talented colleague Gavita) for an introduction to time-series forecasting algorithms.

Click through for part one in an ongoing series.

Comments closed

Optimizing BERT Models on Google Colab

Published 2021-07-08 by Kevin Feasel

Kevin Jacobs fine-tunes some NLP processes:

BERT is a language model and can thus be used for predicting the next word in a sentence. Furthermore, BERT can be used for automatic summarization, text classification and many more downstream tasks. Google Colab provides you with a cloud-based environment on which you can train your machine learning models on a GPU. The downside is that your data is uploaded to the Google cloud. Google Colab gives you the opportunity to finetune BERT.

Click through to see how.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31