Press "Enter" to skip to content

Category: Python

Calculating the Size of Dataflow Gen2 Staging Lakehouses

Sandeep Pawar busts out the calculator:

My friend Alex Powers (PM, Fabric CAT) wrote a blog post about cleaning the staging lakehouses generated by Dataflow Gen2. Before reading this blog, go ahead and read his blog first on the mechanics of it and the whys. Note that these are system generated lakehouses so at some time in the future, they will be automatically purged but until then the users will be paying the storage cost of these lakehouses. If you want to read more about how dataflow gen2 works and whether you should stage or not , read this and this blog.

Read on for a Python script using the SemPy library.

Comments closed

Polymorphism in Python

Rajendrra Gupta talks object-orientation:

Polymorphism is a popular term in object-oriented programming (OOP) languages. An object can take multiple forms in different ways in polymorphism. For example, a woman takes different roles in her daily life, such as wife, professional, athlete, mother, and daughter, as the diagram below depicts:

Polymorphism isn’t a particularly difficult topic to understand, though because of the way that different languages implement the idea in subtly different ways, it’s good to know what you’re able to do in your language of choice.

Comments closed

Transferring Linear Model Coefficients

Nina Zumel performs a swap:

A quick glance through the scikit-learn documentation on linear models, or the CRAN task view on Mixed, Multilevel, and Hierarchical Models in R reveals a number of different procedures for fitting models with linear structure. Each of these procedures meet different needs and constraints, and some of them can be computationally intensive to compute. But in the end, they all have the same underlying structure: outcome is modelled as a linear combination of input features.

But the existence of so many different algorithms, and their associated software, can obscure the fact that just because two models were fit differently, they don’t have to be run differently. The fitting implementation and the deployment implementation can be distinct. In this note, we’ll talk about transferring the coefficients of a linear model to a fresh model, without a full retraining.

I had a similar problem about 18 months ago, though much easier than the one Nina describes, as I did have access to the original data and simply needed to build a linear regression in Python that matched exactly the one they developed in R. Turns out that’s not as easy to do as you might think: the different languages have different default assumptions that make the results similar but not the same, and piecing all of this together took a bit of sleuthing.

Comments closed

An Introduction to Streamlit

I have started a new video series:

In this video, I talk about Streamlit, a great Python library for building data applications quickly. We discuss what data applications are, get an idea of how Streamlit compares to other code-first data visualization techniques, and start building a demo application. I also toss in a lengthy sidebar on Python virtual environments because of how important they are.

Streamlit certainly has its foibles—many of which I’ll cover in the series—but I like it a lot as a simple way of building data applications.

Comments closed

Automate the Power BI Incremental Refresh Policy via Semantic Link Labs

Gilbert Quevauvilliers needs to get rid of some data fast:

The scenario here is that quite often there is a requirement to only keep data from a specific start date, or where it should be keeping data for the last N number of years (which is the first day in January).

Currently in Power BI using the default Incremental refresh settings this is not possible. Typically, you must keep more data than is required.

It is best illustrated by using a working example.

Check out that scenario and how you can use the Semantic Link Labs Python library to resolve it.

Comments closed

Parquet Files in Pandas

Chris LaGreca works with Parquet files:

Apache Parquet has become one of the defacto standards in modern data architecture. This open source, columnar data format serves as the backbone of many high-powered analytics and machine learning pipelines, supported by many of the worlds most sophisticated platforms and services. AWS, Azure, and Google Cloud all offer built-in support for Parquet while big data tools like Hadoop, Spark, Hive, and Databricks natively support Parquet, allowing seamless data processing and analytics. Parquet is also foundational in data lakehouse formats like Delta Lake, Iceberg, and Hudi, where its features are further enhanced.

Parquet is efficient and has broad industry support. In this post, I will showcase a few simple techniques to demonstrate working with Parquet and leveraging its special features using Pandas.

Pandas does make this rather easy, as Chris shows.

Comments closed

Parallel Download in Oracle Object Storage

Brendan Tierney continues a series on Oracle Object Storage:

In previous posts, I’ve given example Python code (and functions) for processing files into and out of OCI Object and Bucket Storage. One of these previous posts includes code and a demonstration of uploading files to an OCI Bucket using the multiprocessing package in Python.

Building upon these previous examples, the code below will download a Bucket using parallel processing. Like my last example, this code is based on the example code I gave in an earlier post on functions within a Jupyter Notebook.

Click through for the code.

Comments closed

Tips for Choosing a Classifier

I’ve wrapped up yet another series:

In this video, I wrap up the series on classification and provide some quick-and-dirty tips on when to use each of the classification algorithms we have discussed.

This was a series I really enjoyed. I’ve had a talk on the topic for a few years, but getting the opportunity to dig in deeper and spend a few hours on the topic was nice. It also helped me fill in some gaps in my understanding and fix a few long-standing bugs in my demo code, so it’s got that going for it as well.

Comments closed

Suspend and Resume Microsoft Fabric Capacity

Olivier Van Steenlandt saves some cash:

With only a limited budget for exploring and testing new tools, I had to figure out how to use my budget efficiently. Therefore, before making any decisions, I looked at the Microsoft Fabric pricing and possibilities.

If you want to take a look at the Microsoft Fabric pricing models, you can find an overview via the following link: Microsoft Fabric – Pricing | Microsoft Azure

To avoid any surprises and to be as cost-effective as possible, I created an easy Python script that I can use to pause and start my Microsoft Fabric capacity, or better said resume and suspend.

I highly recommend this for any organization that does not need 24/7 uptime for Fabric capacity. If you run your system 12 hours a day instead of 24, it takes your F64 capacity from $8k a month to $4k.

Comments closed