Press "Enter" to skip to content

Category: Python

Creating a Parquet File in Python

Ed Pollack has part one of a two-parter:

This article dives into the Apache Parquet file format, how it works, and how it can be used to export and import data directly to SQL Server, even when a data platform that supports Parquet files natively is unavailable to assist.

In the second part of this article, customizations and more advanced options will be highlighted, showing the flexibility of Python as a tool to solve analytic data movement challenges.

I like how Ed covers the Parquet file format, as it’s not all that complicated but it does give you an idea of how so many operations on Parquet data can be so fast.

Leave a Comment

E-Mailing Query Results in Snowflake

Kevin Wilkie gussies up an e-mail:

In our last post, we discussed the most basic of all ways how in Snowflake you can send out a basic email. It was pretty simple, straight text – nothing to really grab the attention of our readers – which we know is the way to craft an email, right?

To do this, we’re going to have some fun in Python. Yes, delve deep into your bag of Python tricks as we get up to some shenanigans with Snowflake and Python.

Read on for a procedure to e-mail the prior result set in HTML format.

Leave a Comment

Debugging in Databricks

Chen Hirsh enables a debugger:

Do you know that feeling, when you write beautiful code and everything just works perfectly on the first try?

I don’t.

Every time I write code It doesn’t work in the beginning, and I have to debug it, make changes, test it…

Databricks introduced a debugger you can use on a code cell, and I’ve wanted to try it for quite some time now. Well, I guess the time is now 

I’m having trouble in finding the utility for a debugger here. Notebooks are already set up for debugging: you can easily add or remove cells and the underlying session maintains state between cells.

Leave a Comment

Querying a Fabric KQL Database via REST API

Sandeep Pawar grabs some data:

I have previously explained how to query a KQL database in a notebook using the Kusto Spark connector, Kusto Python SDK, and KQLMagic. Now, let’s explore another method using the REST API. Although this is covered in the ADX documentation, it isn’t in Fabric (with example), so I wanted to write a quick blog to show how you can query a table from an Eventhouse using a REST API.

Click through to see how you can do it. Sandeep’s code is in Python but because this is just hitting a REST API rather than using a library, you could also use some tool like Postman.

Leave a Comment

Analyzing Delta Table Measures in Microsoft Fabric

Sandeep Pawar has a script for us:

I have been sitting on this code for a long time. I shared the first version in one of my blogs on Direct Lake last year. I have been making updates to it since then as needed. I waited for the lakehouse schema to become available and then forgot to blog about it. Yesterday, someone reached out asking if the above could be used for warehouse delta tables in Fabric, so here you go. It’s 250+ lines so let me just explain what’s going on here:

Read on for the explanation, the script itself, a demonstration, and several additional notes.

Leave a Comment

Charting Microsoft Fabric Workspace Activity

Sandeep Pawar creates a chart:

Semantic Link Labs v0.8.3 has list_activities method to get the list of all activities in your Fabric tenant. It uses the same Power BI Admin - Get Activity Events API but this API now also includes Fabric activities. Note that this is an Admin API so you need to be a Fabric administrator. Check the API details.

To answer the above question, I will use the admin.list_activity_events , loop over the last 30 days and plot the results by Fabric item type in my personal tenant:

Click through for that code. Though if you’re going to do something similar in your environment, I recommend not using a line chart for this visual, as line charts indicate a flow over time and this is definitely point-in-time categorical data. A bar chart or dot plot would be better in that case.

Leave a Comment

A Primer on SparkSQL and PySpark

Anurag K covers the basics of PySpark:

In the era of big data, efficient data processing is critical for insights-driven decision-making. PySpark SQL, a part of Apache Spark, enables data engineers and analysts to work with structured data at massive scale. Combining SQL’s simplicity with Spark’s processing power, it opens a gateway to handling vast datasets seamlessly. This comprehensive guide walks you through PySpark SQL, from foundational concepts to advanced querying techniques, with detailed code examples. Let’s dive in and master PySpark SQL for data-driven analytics.

Click through for examples covering a variety of operations you can perform.

Leave a Comment

Fabric List Connections API in Semantic Link Labs

Sandeep Pawar has an update for us:

In you case you missed it, List Connections Admin API is now live in Fabric. It was shipped in Semantic Link Labs v 0.7.4 a few weeks ago but at the time of the release it was still private. This API returns all the connections set up in the tenant and requires admin privileges. I still can’t find documentation on it so wait for the official details. Note that this API is different from item – list connection API which lists connections used by an item.

Read on to see what you can get from it.

Leave a Comment

Lexing DAX with PyDAX

Sandeep Pawar reviews a DAX lexer:

The power of open-source and GenAI. Klaus Jürgen Folz recently open-sourced the PyDAX library, which parses DAX expressions to extract or remove comments, and identify referenced columns and measures. I used that library to create some demos for myself and then shared the notebook along with instructions with Replit agents to build an app for me.. 15 minutes & 3 prompts later I had a fully functional app. Give it a try : https://daxparser.replit.app/

Read on to learn more, including why I referred to PyDAX as a “lexer” and a few more notes of relevance.

Leave a Comment

Map and FlatMap in PySpark

Vipul Kumar does a bit of work with resilient distributed datasets:

PySpark, the Python API for Apache Spark, is widely used for big data processing and distributed computing. It enables data engineers and data scientists to efficiently process large datasets using resilient distributed datasets (RDDs) and DataFrames. Two commonly used transformations in PySpark are map() and flatMap(). These functions allow users to perform operations on RDDs and are pivotal in distributed data processing.

In this blog, we’ll explore the key differences between map() and flatMap(), their use cases, and how they can be applied in PySpark.

The DataFrame approach has all but obviated having developers use the original Hadoop-like map-reduce approach to writing code in Spark. Even so, I do think it’s useful to know how it all works.

Comments closed