Press "Enter" to skip to content

Category: Python

Debugging in Databricks

Chen Hirsh enables a debugger:

Do you know that feeling, when you write beautiful code and everything just works perfectly on the first try?

I don’t.

Every time I write code It doesn’t work in the beginning, and I have to debug it, make changes, test it…

Databricks introduced a debugger you can use on a code cell, and I’ve wanted to try it for quite some time now. Well, I guess the time is now 

I’m having trouble in finding the utility for a debugger here. Notebooks are already set up for debugging: you can easily add or remove cells and the underlying session maintains state between cells.

Comments closed

Querying a Fabric KQL Database via REST API

Sandeep Pawar grabs some data:

I have previously explained how to query a KQL database in a notebook using the Kusto Spark connector, Kusto Python SDK, and KQLMagic. Now, let’s explore another method using the REST API. Although this is covered in the ADX documentation, it isn’t in Fabric (with example), so I wanted to write a quick blog to show how you can query a table from an Eventhouse using a REST API.

Click through to see how you can do it. Sandeep’s code is in Python but because this is just hitting a REST API rather than using a library, you could also use some tool like Postman.

Comments closed

Analyzing Delta Table Measures in Microsoft Fabric

Sandeep Pawar has a script for us:

I have been sitting on this code for a long time. I shared the first version in one of my blogs on Direct Lake last year. I have been making updates to it since then as needed. I waited for the lakehouse schema to become available and then forgot to blog about it. Yesterday, someone reached out asking if the above could be used for warehouse delta tables in Fabric, so here you go. It’s 250+ lines so let me just explain what’s going on here:

Read on for the explanation, the script itself, a demonstration, and several additional notes.

Comments closed

Charting Microsoft Fabric Workspace Activity

Sandeep Pawar creates a chart:

Semantic Link Labs v0.8.3 has list_activities method to get the list of all activities in your Fabric tenant. It uses the same Power BI Admin - Get Activity Events API but this API now also includes Fabric activities. Note that this is an Admin API so you need to be a Fabric administrator. Check the API details.

To answer the above question, I will use the admin.list_activity_events , loop over the last 30 days and plot the results by Fabric item type in my personal tenant:

Click through for that code. Though if you’re going to do something similar in your environment, I recommend not using a line chart for this visual, as line charts indicate a flow over time and this is definitely point-in-time categorical data. A bar chart or dot plot would be better in that case.

Comments closed

A Primer on SparkSQL and PySpark

Anurag K covers the basics of PySpark:

In the era of big data, efficient data processing is critical for insights-driven decision-making. PySpark SQL, a part of Apache Spark, enables data engineers and analysts to work with structured data at massive scale. Combining SQL’s simplicity with Spark’s processing power, it opens a gateway to handling vast datasets seamlessly. This comprehensive guide walks you through PySpark SQL, from foundational concepts to advanced querying techniques, with detailed code examples. Let’s dive in and master PySpark SQL for data-driven analytics.

Click through for examples covering a variety of operations you can perform.

Comments closed

Fabric List Connections API in Semantic Link Labs

Sandeep Pawar has an update for us:

In you case you missed it, List Connections Admin API is now live in Fabric. It was shipped in Semantic Link Labs v 0.7.4 a few weeks ago but at the time of the release it was still private. This API returns all the connections set up in the tenant and requires admin privileges. I still can’t find documentation on it so wait for the official details. Note that this API is different from item – list connection API which lists connections used by an item.

Read on to see what you can get from it.

Comments closed

Lexing DAX with PyDAX

Sandeep Pawar reviews a DAX lexer:

The power of open-source and GenAI. Klaus Jürgen Folz recently open-sourced the PyDAX library, which parses DAX expressions to extract or remove comments, and identify referenced columns and measures. I used that library to create some demos for myself and then shared the notebook along with instructions with Replit agents to build an app for me.. 15 minutes & 3 prompts later I had a fully functional app. Give it a try : https://daxparser.replit.app/

Read on to learn more, including why I referred to PyDAX as a “lexer” and a few more notes of relevance.

Comments closed

Map and FlatMap in PySpark

Vipul Kumar does a bit of work with resilient distributed datasets:

PySpark, the Python API for Apache Spark, is widely used for big data processing and distributed computing. It enables data engineers and data scientists to efficiently process large datasets using resilient distributed datasets (RDDs) and DataFrames. Two commonly used transformations in PySpark are map() and flatMap(). These functions allow users to perform operations on RDDs and are pivotal in distributed data processing.

In this blog, we’ll explore the key differences between map() and flatMap(), their use cases, and how they can be applied in PySpark.

The DataFrame approach has all but obviated having developers use the original Hadoop-like map-reduce approach to writing code in Spark. Even so, I do think it’s useful to know how it all works.

Comments closed

A Survey of Predictive Analytics Techniques

Akmal Chaudhri tries a bunch of things:

In this short article, we’ll explore loan approvals using a variety of tools and techniques. We’ll begin by analyzing loan data and applying Logistic Regression to predict loan outcomes. Building on this, we’ll integrate BERT for Natural Language Processing to enhance prediction accuracy. To interpret the predictions, we’ll use SHAP and LIME explanation frameworks, providing insights into feature importance and model behavior. Finally, we’ll explore the potential of Natural Language Processing through LangChain to automate loan predictions, using the power of conversational AI.

Click through for the notebook, as well as an overview of what the notebook includes. I don’t particularly like word clouds as the “solution” in the BERT example, though without real data to perform any sort of NLP, there’s not much you can meaningfully do.

Comments closed

A Primer on Pandas

Rajendra Gupta talks about Pandas:

Have you heard about Pandas in Python? It is widely used open-source library for analyzing and manipulating data in the Python programming language. Let’s explore it with use cases and examples.

Click through for an overview of the library. Pandas isn’t the quickest performer as your data sets get large, but for ease of use on moderately-sized datasets (up to hundreds of thousands of rows, or maybe millions if you manage things well), it does a good job.

Comments closed