Python – Page 32 – Curated SQL

Data Quality Checks in Power BI

Published 2022-07-28 by Kevin Feasel

Picture this, you have a report in Power BI that someone passes off to you for data quality checks. There are a few ways to make sure your measures match what is in the source data system, but for this demo we are going to use python and excel to perform our data quality checks in one batch. In order to do that, we are going to build a python script that can run Power BI REST APIs, connect to a SQL Server, and connect to Excel to grab the formulas and to push back the quality check into Excel for final review. To find a sample Excel and the final python script, please refer to my GitHub.

Check out the GitHub repo as well as Kristyna’s very detailed walkthrough.

Comments closed

Python UDFs in Databricks SQL

Published 2022-07-25 by Kevin Feasel

Martin Grund, et al, announce a new preview feature in Databricks:\

To define the Python UDF, all you have to do is a CREATE FUNCTION SQL statement. This statement defines a function name, input parameters and types, specifies the language as PYTHON, and provides the function body between $$.
The function body of a Python UDF in Databricks SQL is equivalent to a regular Python function, with the UDF itself returning the computation’s final value. Dependencies from the Python standard library and Databricks Runtime 10.4, such as the json package in the above example, can be imported and used in your code. You can also define nested functions inside your UDF to encapsulate code to build or reuse complex logic.

I think my biggest concern here would be performance, though I say that without having used the feature.

Comments closed

Pre-Processing Data Explorer Data with Spark

Published 2022-07-22 by Kevin Feasel

Hauke Mallow does some data engineering:

We often see customer scenarios where historical data has to be migrated to Azure Data Explorer (ADX). Although ADX has very powerful data-transformation capabilities via update policies, sometimes more or less complex data engineering tasks must be done upfront. This happens if the original data structure is too complex or just single data elements being too big, hitting data explorer limits of dynamic columns of 1 MB or maximum ingest file-size of 1 GB for uncompressed data (see also Comparing ingestion methods and tools) .
Let’s think about an Industrial Internet-of-Things (IIoT) use-case where you get data from several production lines. In the production line several devices read humidity, pressure, etc. The following example shows a scenario where a one-to-many relationship is implemented within an array. With this you might get very large columns (with millions of device readings per production line) that might exceed the limit of 1 MB in Azure Data Explorer for dynamic columns. In this case you need to do some pre-processing.

Click through to see how you can do this with an Azure Synapse Analytics Spark pool prior to ingesting it with a Data Explorer pool.

Comments closed

Customer Segmentation via Databricks Solution Accelerator

Published 2022-06-29 by Kevin Feasel

Gavita Regunath discovers customer segments in a dataset:

We will be using the German Credit dataset, a publicly available dataset provided by Dr. Hans Hofmann of the University of Hamburg. The German Credit dataset contains features describing 1000 loan applicants who have taken credit from the bank. Using this dataset, our aim will be to understand the following “How should the bank personalise its products for its customers?”.

Click through to see an example of clustering to generate customer segments.

Comments closed

PHI De-Identification in Databricks with NLP

Published 2022-06-24 by Kevin Feasel

Amir Kermany, et al, share a set of notebooks:

John Snow Labs, the leader in Healthcare natural language processing (NLP), and Databricks are working together to help organizations process and analyze their text data at scale with a series of Solution Accelerator notebook templates for common NLP use cases. You can learn more about our partnership in our previous blog, Applying Natural Language Processing to Health Text at Scale.
To help organizations automate the removal of sensitive patient information, we built a joint Solution Accelerator for PHI removal that builds on top of the Databricks Lakehouse for Healthcare and Life Sciences. John Snow Labs provides two commercial extensions on top of the open-source Spark NLP library — both of which are useful for de-identification and anonymization tasks — that are used in this Accelerator:

This is a really interesting scenario.

Comments closed

Saving and Loading a Keras Model

Published 2022-06-23 by Kevin Feasel

Jason Brownlee made it to a savepoint in time:

Given that deep learning models can take hours, days and even weeks to train, it is important to know how to save and load them from disk.
In this post, you will discover how you can save your Keras models to file and load them up again to make predictions.
After reading this tutorial you will know:
– How to save model weights and model architecture in separate files.
– How to save model architecture in both YAML and JSON format.
– How to save model weights and architecture into a single file for later use.

Read on for an updated step-by-step tutorial.

Comments closed

Comparing Data Analysis in Java and Python

Published 2022-06-16 by Kevin Feasel

Manu Barriola does some data analysis in a pair of quite different languages:

Python is a dynamically typed language, very straightforward to work with, and is certainly the language of choice to do complex computations if we don’t have to worry about intricate program flows. It provides excellent libraries (Pandas, NumPy, Matplotlib, ScyPy, PyTorch, TensorFlow, etc.) to support logical, mathematical, and scientific operations on data structures or arrays.
Java is a very robust language, strongly typed, and therefore has more stringent syntactic rules that make it less prone to programmatic errors. Like Python provides plenty of libraries to work with data structures, linear algebra, machine learning, and data processing (ND4J, Mahout, Spark, Deeplearning4J, etc.).
In this article, we’re going to focus on a narrow study of how to do simple data analysis of large amounts of tabular data and compute some statistics using Java and Python. We’ll see different techniques on how to do the data analysis on each platform, compare how they scale, and the possibilities to apply parallel computing to improve their performance.

Read on to see how the two compare. Note that this is base Java and Python+Pandas, not Spark/PySpark, Koalas, etc.

Comments closed

Normalization Layers in Deep Learning Models

Published 2022-06-16 by Kevin Feasel

Zhe Ming Chng explains why data normalization matters in data science:

You’ve probably been told to standardize or normalize inputs to your model to improve performance. But what is normalization and how can we implement it easily in our deep learning models to improve performance? Normalizing our inputs aims to create a set of features that are on the same scale as each other, which we’ll explore more in this article.
Also, thinking about it, in neural networks, the output of each layer serves as the inputs into the next layer, so a natural question to ask is: If normalizing inputs to the model helps improve model performance, does standardizing the inputs into each layer help to improve model performance too?

Click through for the tutorial.

Comments closed

Generating an Expression Variable for Joins with PySpark

Published 2022-06-15 by Kevin Feasel

Unmesha Sreeveni uses a variable to effect a join in PySpark:

Lets see how to join 2 table with a parameterized on condition in PySpark
Eg: I have 2 dataframes A and B and I want to join them with id,inv_no,item and subitem

Click through to see how. It turns out to be pretty straightforward.

Comments closed

Ingesting Event Hub Telemetry Data with PySpark Streaming

Published 2022-06-06 by Kevin Feasel

Charles Chukwudozie shows how to read from Event Hubs in Databricks with Python:

Ingesting, storing, and processing millions of telemetry data from a plethora of remote IoT devices and Sensors has become common place. One of the primary Cloud services used to process streaming telemetry events at scale is Azure Event Hub.
Most documented implementations of Azure Databricks Ingestion from Azure Event Hub Data are based on Scala.
So, in this post, I outline how to use PySpark on Azure Databricks to ingest and process telemetry data from an Azure Event Hub instance configured without Event Capture.

Click through for the process.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Category: Python