Python – Page 22 – Curated SQL

Package Management in Python

Published 2023-10-09 by Kevin Feasel

Georgia Atkinson wraps things up with a bow:

Python is a general purpose, high level language which, thanks to its simplicity and versatility, has become very popular, especially within the data science community. The extensive Python community has developed and contributed thousands of libraries and packages over the years in a plethora of different disciplines to aid developers with their applications. Managing these packages can be a challenging task without the correct tools. That’s where Python package managers come in. In this blog post we will explore what a package manager is and why they are important. We will then cover some popular examples, including how to use them, how to install them and the pros and cons of each.

Whilst we will briefly touch on virtual environments in places, we will explore these in more depth in an upcoming post.

Read on for a primer on three options, including how they compare to one another for CI/CD purposes.

Comments closed

Visualizing Kusto Graphs with Plotly and Python

Published 2023-10-05 by Kevin Feasel

Henning Rauch creates some plots:

Graphs are a powerful way to model and analyse complex relationships between entities, such as cybersecurity incidents, network traffic, social networks, and more. Kusto, the query and analytics engine of Azure Data Explorer, Microsoft Fabric Real-Time Analytics and many more recently introduced a new feature that enables users to contextualize their data using graphs. In this blog post, we will show you how to use graph semantics to create and explore graph data in Kusto, and how to visualize it using Plotly, a popular library for interactive data visualization in Python.

Graph semantics are a set of operators that allow users to work with graph data in Kusto, without the need to use a separate graph database or framework.

Click through for the KQL you’ll need, as well as how to display that in Plotly.

Comments closed

Reshaping Records using cdata

Published 2023-10-04 by Kevin Feasel

John Mount takes us through a common data wrangling problem:

In many data science projects we have the data, but it “is in the wrong format.” Fortunately re-formatting or reshaping data is a solved problem, with many different available tools.

For this note, I would like to show how to reshape data using the data algebra‘s cdata data reshaping tool. This should give you familiarity with a tool to use on your own data.

Click through for an example in Python. Mount and Nina Zumel also have an R package for cdata.

Comments closed

ggplot2 in Python Notebooks

Published 2023-09-26 by Kevin Feasel

John Mount runs R in Python with rpy2:

For an article on A/B testing that I am preparing, I asked my partner Dr. Nina Zumel if she could do me a favor and write some code to produce the diagrams. She prepared an excellent parameterized diagram generator. However being the author of the book Practical Data Science with R, she built it in R using ggplot2. This would be great, except the A/B testing article is being developed in Python, as it targets programmers familiar with Python.

As the production of the diagrams is not part of the proposed article, I decided to use the rpy2 package to integrate the R diagrams directly into the new worksheet. Alternatively, I could translate her code into Python using one of: Seaborn objects, plotnine, ggpy, or others. The large number of options is evidence of how influential Leland Wilkinson’s grammar of graphics (gg) is.

Click through to see how you can execute R code within the context of Python, similar to how you can use the reticulate package to execute Python code in the context of R.

Comments closed

Generating Reproducible Reports with Jupyter and Quarto

Published 2023-09-25 by Kevin Feasel

Parisa Gregg and Myles Mitchell don’t need to copy and paste for their TPS reports:

Quarto is a free-to-use, open-source software based on Pandoc that enables users to convert plain text files into a range of formats, including PDF, HTML and powerpoint presentations. These documents can contain a mixture of narrative text, Python code, and figures that are dynamically generated by the embedded code.

This has many use-cases:

Your company may have a weekly board meeting to go over the latest sales figures. By having a Quarto presentation that pulls in the latest company sales data, you can regenerate the presentation slides each week at the click of a button.

As a researcher you may be preparing a report for publication. By having the code that generates data tables and figures embedded within the report, regenerating the draft as the experimental data floods in is a breeze!

Read on for a fun example of how you could automated a research-driven report.

Comments closed

ML with Keras and TensorFlow over Streaming Kafka Data

Published 2023-09-18 by Kevin Feasel

Paul Brebner gives us a streaming scenario for model training:

One of the goals of incremental learning is to train a model continuously from streaming data. Incremental learning from streaming data means you don’t need all the data in memory at once, and the model is as up-to-date as possible, which can matter for real-time use cases. The third driver for incremental learning that I mentioned in the previous blog is when there is concept drift in the data itself—but we’ll ignore this aspect for the time being.

In the last blog we demonstrated batch training with TensorFlow, and mentioned that TensorFlow, being a neural network framework, has the potential for incremental learning—just like animals and people do. In this blog, we will set ourselves the task of using TensorFlow to demonstrate incremental learning from the same static drone delivery data set of busy/not busy shops that we used in the last blog.

Read on to see the code, results, and warnings.

Comments closed

Where Git Repositories Store File Versions

Published 2023-09-18 by Kevin Feasel

Julia Evans digs into a folder:

Hello! I was talking to a friend about how git works today, and we got onto the topic – where does git store your files? We know that it’s in your .git directory, but where exactly in there are all the versions of your old files?

For example, this blog is in a git repository, and it contains a file called content/post/2019-06-28-brag-doc.markdown. Where is that in my .git folder? And where are the old versions of that file? Let’s investigate by writing some very short Python programs.

Read on to learn how you can parse it all out. And this is also reason number 3 why you don’t want to commit a large file to Git: even if you delete that file later, the contents will live in the .git folder forever, or at least until you take some manual action to excise it from Git’s history.

Comments closed

Creating a Python Function in Snowflake

Published 2023-09-14 by Kevin Feasel

Kevin Wilkie has a helpful function:

I use a Python script as a function inside of my Snowflake instance to make my life easier and today I’d like to share it with you.

Click through for that script, as well as a few notes on the topic.

Comments closed

Generating Tables from Files in Microsoft Fabric via Notebook

Published 2023-09-14 by Kevin Feasel

Dennes Torres performs a bit of ELT:

When Microsoft Fabric was born, the only method to convert files to tables was using notebooks. Nowadays we have an easy-to-use UI feature for the conversion.

As I explained on the article about lakehouse and ETL, there are some scenarios where we still need to use notebooks for the conversion. One of these scenarios is when we need table partitioning.

Let’s make a step-by-step on this blog about how to use notebooks and table partitioning.

Click through to see how it all works.

Comments closed

Querying the Power BI REST API from Fabric Spark

Published 2023-09-01 by Kevin Feasel

Gerhard Brueckl makes the call:

Microsoft Fabric has a lot of different components which usually work very well together. However, even though Power BI is a fundamental part of Fabric, there is not really a tight integration between Data Engineering components and Power BI. In this blog post I will show you an easy and reusable way to query the Power BI REST API via Fabric SQL in a very straight forward way. The extracted data can then be stored in the data lake e.g. to create a history of your dataset refreshes, the state of your workspaces or any other information that is provided by the REST API.

Click through for a list of operations, followed by the code you’ll need to pull this off.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Category: Python