Press "Enter" to skip to content

Category: Python

pyspark.pandas in Apache Spark 3.2

Hyukjin Kwon and Xinrong Meng announce a built-in pandas API for Apache Spark 3.2:

We’re thrilled to announce the pandas API as part of the upcoming Apache Spark™ 3.2 release. pandas is a powerful, flexible library and has grown rapidly to become one of the standard data science libraries. Now pandas users can leverage the pandas API on their existing Spark clusters.

A few years ago, we launched Koalas, an open source project that implements the pandas DataFrame API on top of Spark, which became widely adopted among data scientists. Recently, Koalas was officially merged into PySpark by SPIP: Support pandas API layer on PySpark as part of Project Zen (see also Project Zen: Making Data Science Easier in PySpark from Data + AI Summit 2021).

pandas users can now scale their workloads with one simple line change in the upcoming Spark 3.2 release:

Click through to see more details on the change.

Comments closed

SCD Type 2 with Delta Lake

Chris Williams continues a series on slowly changing dimensions in Delta Lake:

Type 2 SCD is probably one of the most common examples to easily preserve history in a dimension table and is commonly used throughout any Data Warehousing/Modelling architecture. Active rows can be indicated with a boolean flag or a start and end date. In this example from the table above, all active rows can be displayed simply by returning a query where the end date is null.

Read on to see how you can implement this pattern using Delta Lake’s capabilities.

Comments closed

What is Pandas?

Lina Kovacheva starts a new series on Pandas:

First and foremost – what is Pandas?

Pandas is a popular Python library that allows users to easily analyse and manipulate data. It offers powerful and flexible data structures and is vastly popular among data scientists and analysts. As with any other library to be able to use Pandas you have to import the library. 

Click through to learn more.

Comments closed

Databricks Autologging

Corey Zumar and Kasey Uhlenhuth announce a new product:

Machine learning teams require the ability to reproduce and explain their results–whether for regulatory, debugging or other purposes. This means every production model must have a record of its lineage and performance characteristics. While some ML practitioners diligently version their source code, hyperparameters and performance metrics, others find it cumbersome or distracting from their rapid prototyping. As a result, data teams encounter three primary challenges when recording this information: (1) standardizing machine learning artifacts tracked across ML teams, (2) ensuring reproducibility and auditability across a diverse set of ML problems and (3) maintaining readable code across many logging calls.

Read on to see how Databricks Autologging can satisfy these issues.

Comments closed

Projecting Disk Space Available

Constantine Kokkinos predicts the future:

The first question I wanted to model out was a bigger issue with on-premises databases – when are we going to run out of storage?

Back in the day I’d cheat with msdb backups, comparing compressed sized to actuals, and moving on. However I don’t have a historical reference for Stack Overflow… so what can I do?

Taking a look at the tables we see a commonality in many tables – CreationDate! It looks like the rows faithfully are stamped when they are created.

Constantine does at the end hit on something we tend to forget: most operations in life aren’t quite linear. We often get lucky in that certain stretches are close enough to be linear that we can model them that way, but even in this dataset, you can see the effects of polynomial growth slowly build up. Still, this is a good way of taking us through what an analysis and projection can look like.

Comments closed

Django Support for SQL Server

Warren Chu announces a 1.0 version of a new product:

We’re officially announcing the release of mssql-django v1.0 as an open source project!

At Microsoft we’ve heard from the community loud and clear – SQL Server is the biggest enterprise backend not yet fully supported in Django.

That’s about to change.

This project picks up where previous open source projects have left off. We began with a series of preview releases in February 2021, and we’re pleased to officially bring Microsoft support to SQL Server and Azure SQL DB with this version’s official release.

Django is still a fairly popular platform, so I’m happy to see this released.

Comments closed

Importing SQL Server Extended Properties into Azure Purview

Daniel Janik shows how you can use PyApacheAtlas to move specific SQL Server extended properties into Azure Purview:

This post is going to be restricted to only SQL Server Table Columns and only Extended Properties named MS_Description. Quite a few years ago I worked on a data catalog project where we added descriptions for many of the tables, views, and columns to the database using extended properties named MS_Description. Let’s assume you have some of these for this post keeping in mind that the Purview APIs provide so many functions beyond what this post covers and that the code here could be modified to do so much more as well.

Starting out I thought it would be great to import the sensitivity classifications that SSMS creates. Pre-SQL 2019 these were held in Extended Properties and now have their very own DMV (sys.sensitivity_classifications). While this sounded great in theory it wasn’t as exciting when I wrote the code. This is because Azure Purview already has system classifications at a more granular scale for each of the ones you find in SSMS and Purview also adds these as it executes a scan on the data source. It does a pretty good job too. With that said, I shifted my focus to adding descriptions instead.

Read on to see how you can do this.

Comments closed

Orchestrating ML Pipelines with Amazon Managed Workflows for Airflow

Juston Leto, et al, show off MLOps capabilities in AWS:

The ability to scale machine learning operations (MLOps) at an enterprise is quickly becoming a competitive advantage in the modern economy. When firms started dabbling in ML, only the highest priority use cases were the focus. Businesses are now demanding more from ML practitioners: more intelligent features, delivered faster, and continually maintained over time. An effective MLOps strategy requires a unified platform that can orchestrate and automate complex data processing and ML tasks, and integrates with the latest tooling to best complete those tasks.

This post demonstrates the value of using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate an ML pipeline using the popular XGBoost (eXtreme Gradient Boosting) algorithm. For more advanced and comprehensive MLOps capabilities, including a purpose-built model orchestration framework and a continuous integration and continuous delivery (CI/CD) service for ML, readers are encouraged to check out Amazon SageMaker Pipelines.

Read on for a step-by-step tutorial on the process.

Comments closed

Explaining an ML Model with SHAP

Dan Lantos, et al, walk us through one technique for model explainability:

Interpretability has to do with how accurately a machine learning model can associate a cause (input) to an effect (output). 

Explainability on the other hand is the extent to which the internal mechanics of a machine or deep learning system can be explained in human terms. Or to put it simply, explainability is the ability to explain what is happening. 

Let’s consider a simple example illustrated below where the goal of the machine learning model is to classify an animal into its respective groups. We use an image of a butterfly as input into the machine learning model. The model would classify the butterfly as either an insect, mammal, fish, reptile or bird. Typically, most complex machine learning models would provide a classification without explaining how the features contributed to the result. However, using tools that help with explainability, we can overcome this limitation. We can then understand what particular features of the butterfly contributed to it being classified as an insect. Since the butterfly has six legs, it is thus classified as an insect.

Being able to provide a rationale behind a model’s prediction would give the users (and the developers) confidence about the validity of the model’s decision.

Read on to see how you can use a library called SHAP in Python to help with this explainability.

Comments closed

Reinforcement Learning and Python 3

I have a new post up:

I finally got around to trying out a reinforcement learning exercise this weekend in an attempt to learn about the technique. One of the most interesting blog posts I read is Andrej Karpathy’s post on using reinforcement learning to play Pong on the Atari 2600. In it, Andrej uses the Gym package in Python to play the game.

This won’t be a post diving into the details of how reinforcement learning works; Andrej does that far better than I possibly could, so read the post. Instead, the purpose of this post is to provide a minor update to Andrej’s code to switch it from Python 2 to Python 3. In doing this, I went with the most convenient answer over a potentially better solution (e.g., switching xrange() to range() rather then re-working the code), but it does work. I also bumped up the learning rate a little bit to pick up the pace a bit.

Click through for the (slightly) updated code.

Comments closed