Press "Enter" to skip to content

Category: Python

Dataclasses in Python

Evan Seabrook takes us through a Python library:

If you’re really lucky, there will be a docstring for this function that outlines the structure of the parameter user, saving you from having to dig through the function and identify the possible keys that exist in parameter user.

The problem here is twofold:

1. Dictionaries in python are mutable and can have arbitrary schemas. 

a. This in itself isn’t a problem and can be a good thing, depending on your needs. Its usage, however, is really only enabled by the quality of the second point, which is:

2. You must rely on the documentation to know the structure, and the documentation must stay updated as the structure evolves.

Read on to see how the dataclass library can create a wrapper around dictionary objects.

Comments closed

TensorFlow Fundamentals

Tanishka Garg starts a series on TensorFlow:

TensorFlow is an open-source end-to-end machine learning library. It is for preprocessing data, modeling data, and serving models (getting them into the hands of others).

It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML. And developers easily build and deploy ML-powered applications.

Read on for basic setup instructions and a primer on tensors.

Comments closed

pyspark.pandas in Apache Spark 3.2

Hyukjin Kwon and Xinrong Meng announce a built-in pandas API for Apache Spark 3.2:

We’re thrilled to announce the pandas API as part of the upcoming Apache Spark™ 3.2 release. pandas is a powerful, flexible library and has grown rapidly to become one of the standard data science libraries. Now pandas users can leverage the pandas API on their existing Spark clusters.

A few years ago, we launched Koalas, an open source project that implements the pandas DataFrame API on top of Spark, which became widely adopted among data scientists. Recently, Koalas was officially merged into PySpark by SPIP: Support pandas API layer on PySpark as part of Project Zen (see also Project Zen: Making Data Science Easier in PySpark from Data + AI Summit 2021).

pandas users can now scale their workloads with one simple line change in the upcoming Spark 3.2 release:

Click through to see more details on the change.

Comments closed

SCD Type 2 with Delta Lake

Chris Williams continues a series on slowly changing dimensions in Delta Lake:

Type 2 SCD is probably one of the most common examples to easily preserve history in a dimension table and is commonly used throughout any Data Warehousing/Modelling architecture. Active rows can be indicated with a boolean flag or a start and end date. In this example from the table above, all active rows can be displayed simply by returning a query where the end date is null.

Read on to see how you can implement this pattern using Delta Lake’s capabilities.

Comments closed

What is Pandas?

Lina Kovacheva starts a new series on Pandas:

First and foremost – what is Pandas?

Pandas is a popular Python library that allows users to easily analyse and manipulate data. It offers powerful and flexible data structures and is vastly popular among data scientists and analysts. As with any other library to be able to use Pandas you have to import the library. 

Click through to learn more.

Comments closed

Databricks Autologging

Corey Zumar and Kasey Uhlenhuth announce a new product:

Machine learning teams require the ability to reproduce and explain their results–whether for regulatory, debugging or other purposes. This means every production model must have a record of its lineage and performance characteristics. While some ML practitioners diligently version their source code, hyperparameters and performance metrics, others find it cumbersome or distracting from their rapid prototyping. As a result, data teams encounter three primary challenges when recording this information: (1) standardizing machine learning artifacts tracked across ML teams, (2) ensuring reproducibility and auditability across a diverse set of ML problems and (3) maintaining readable code across many logging calls.

Read on to see how Databricks Autologging can satisfy these issues.

Comments closed

Projecting Disk Space Available

Constantine Kokkinos predicts the future:

The first question I wanted to model out was a bigger issue with on-premises databases – when are we going to run out of storage?

Back in the day I’d cheat with msdb backups, comparing compressed sized to actuals, and moving on. However I don’t have a historical reference for Stack Overflow… so what can I do?

Taking a look at the tables we see a commonality in many tables – CreationDate! It looks like the rows faithfully are stamped when they are created.

Constantine does at the end hit on something we tend to forget: most operations in life aren’t quite linear. We often get lucky in that certain stretches are close enough to be linear that we can model them that way, but even in this dataset, you can see the effects of polynomial growth slowly build up. Still, this is a good way of taking us through what an analysis and projection can look like.

Comments closed

Django Support for SQL Server

Warren Chu announces a 1.0 version of a new product:

We’re officially announcing the release of mssql-django v1.0 as an open source project!

At Microsoft we’ve heard from the community loud and clear – SQL Server is the biggest enterprise backend not yet fully supported in Django.

That’s about to change.

This project picks up where previous open source projects have left off. We began with a series of preview releases in February 2021, and we’re pleased to officially bring Microsoft support to SQL Server and Azure SQL DB with this version’s official release.

Django is still a fairly popular platform, so I’m happy to see this released.

Comments closed

Importing SQL Server Extended Properties into Azure Purview

Daniel Janik shows how you can use PyApacheAtlas to move specific SQL Server extended properties into Azure Purview:

This post is going to be restricted to only SQL Server Table Columns and only Extended Properties named MS_Description. Quite a few years ago I worked on a data catalog project where we added descriptions for many of the tables, views, and columns to the database using extended properties named MS_Description. Let’s assume you have some of these for this post keeping in mind that the Purview APIs provide so many functions beyond what this post covers and that the code here could be modified to do so much more as well.

Starting out I thought it would be great to import the sensitivity classifications that SSMS creates. Pre-SQL 2019 these were held in Extended Properties and now have their very own DMV (sys.sensitivity_classifications). While this sounded great in theory it wasn’t as exciting when I wrote the code. This is because Azure Purview already has system classifications at a more granular scale for each of the ones you find in SSMS and Purview also adds these as it executes a scan on the data source. It does a pretty good job too. With that said, I shifted my focus to adding descriptions instead.

Read on to see how you can do this.

Comments closed

Orchestrating ML Pipelines with Amazon Managed Workflows for Airflow

Juston Leto, et al, show off MLOps capabilities in AWS:

The ability to scale machine learning operations (MLOps) at an enterprise is quickly becoming a competitive advantage in the modern economy. When firms started dabbling in ML, only the highest priority use cases were the focus. Businesses are now demanding more from ML practitioners: more intelligent features, delivered faster, and continually maintained over time. An effective MLOps strategy requires a unified platform that can orchestrate and automate complex data processing and ML tasks, and integrates with the latest tooling to best complete those tasks.

This post demonstrates the value of using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate an ML pipeline using the popular XGBoost (eXtreme Gradient Boosting) algorithm. For more advanced and comprehensive MLOps capabilities, including a purpose-built model orchestration framework and a continuous integration and continuous delivery (CI/CD) service for ML, readers are encouraged to check out Amazon SageMaker Pipelines.

Read on for a step-by-step tutorial on the process.

Comments closed