Python – Page 27 – Curated SQL

Saving and Loading a Keras Model

Published 2022-06-23 by Kevin Feasel

Jason Brownlee made it to a savepoint in time:

Given that deep learning models can take hours, days and even weeks to train, it is important to know how to save and load them from disk.
In this post, you will discover how you can save your Keras models to file and load them up again to make predictions.
After reading this tutorial you will know:
– How to save model weights and model architecture in separate files.
– How to save model architecture in both YAML and JSON format.
– How to save model weights and architecture into a single file for later use.

Read on for an updated step-by-step tutorial.

Comments closed

Normalization Layers in Deep Learning Models

Published 2022-06-16 by Kevin Feasel

Zhe Ming Chng explains why data normalization matters in data science:

You’ve probably been told to standardize or normalize inputs to your model to improve performance. But what is normalization and how can we implement it easily in our deep learning models to improve performance? Normalizing our inputs aims to create a set of features that are on the same scale as each other, which we’ll explore more in this article.
Also, thinking about it, in neural networks, the output of each layer serves as the inputs into the next layer, so a natural question to ask is: If normalizing inputs to the model helps improve model performance, does standardizing the inputs into each layer help to improve model performance too?

Click through for the tutorial.

Comments closed

Comparing Data Analysis in Java and Python

Published 2022-06-16 by Kevin Feasel

Manu Barriola does some data analysis in a pair of quite different languages:

Python is a dynamically typed language, very straightforward to work with, and is certainly the language of choice to do complex computations if we don’t have to worry about intricate program flows. It provides excellent libraries (Pandas, NumPy, Matplotlib, ScyPy, PyTorch, TensorFlow, etc.) to support logical, mathematical, and scientific operations on data structures or arrays.
Java is a very robust language, strongly typed, and therefore has more stringent syntactic rules that make it less prone to programmatic errors. Like Python provides plenty of libraries to work with data structures, linear algebra, machine learning, and data processing (ND4J, Mahout, Spark, Deeplearning4J, etc.).
In this article, we’re going to focus on a narrow study of how to do simple data analysis of large amounts of tabular data and compute some statistics using Java and Python. We’ll see different techniques on how to do the data analysis on each platform, compare how they scale, and the possibilities to apply parallel computing to improve their performance.

Read on to see how the two compare. Note that this is base Java and Python+Pandas, not Spark/PySpark, Koalas, etc.

Comments closed

Generating an Expression Variable for Joins with PySpark

Published 2022-06-15 by Kevin Feasel

Unmesha Sreeveni uses a variable to effect a join in PySpark:

Lets see how to join 2 table with a parameterized on condition in PySpark
Eg: I have 2 dataframes A and B and I want to join them with id,inv_no,item and subitem

Click through to see how. It turns out to be pretty straightforward.

Comments closed

Ingesting Event Hub Telemetry Data with PySpark Streaming

Published 2022-06-06 by Kevin Feasel

Charles Chukwudozie shows how to read from Event Hubs in Databricks with Python:

Ingesting, storing, and processing millions of telemetry data from a plethora of remote IoT devices and Sensors has become common place. One of the primary Cloud services used to process streaming telemetry events at scale is Azure Event Hub.
Most documented implementations of Azure Databricks Ingestion from Azure Event Hub Data are based on Scala.
So, in this post, I outline how to use PySpark on Azure Databricks to ingest and process telemetry data from an Azure Event Hub instance configured without Event Capture.

Click through for the process.

Comments closed

Creating Reproducible Examples with CI

Published 2022-06-03 by Kevin Feasel

Colin Gillespie and Jack Walton tackle a common training problem:

As the number of courses we offer increased, so did the maintenance burden of our associated training materials (lecture notes, slides, exercises, and more). To ease this burden, and to assist in ensuring that our training materials build consistently, we developed an R package called {jrNotes2}. Amongst other things, this package ensures that all courses:
– have identical “template files”: .gitlab-ci.yml, .gitignore, Makefiles, index.Rmd, …;
– have the same directory structure, and
– pass a set of quality-assurance checks.

This is smart but read on to see why it’s still a challenge. This is especially true in the R and Python worlds, where breaking changes seem to be so common.

Comments closed

Monitoring Streaming Queries in PySpark

Published 2022-05-31 by Kevin Feasel

Hyukjin Kwon, et al, lay out some monitoring advice:

Streaming is one of the most important data processing techniques for ingestion and analysis. It provides users and developers with low latency and real-time data processing capabilities for analytics and triggering actions. However, monitoring streaming data workloads is challenging because the data is continuously processed as it arrives. Because of this always-on nature of stream processing, it is harder to troubleshoot problems during development and production without real-time metrics, alerting and dashboarding.

Read on to see how you can use the Observable API for alerting in PySpark—previously, it had been a Scala-only API.

Comments closed

PyODBC vs C# ODBC Performance Differences

Published 2022-04-25 by Kevin Feasel

Jose Manuel Jurado Diaz explains a performance difference:

A customer asked today, why using ODBC Driver 17 for SQL Server in Python with PYODBC we have a slightly difference in terms of time taken if we compare with C# System.Data.Odbc. Following, I would like to share my lesson learned about it.

Read on for Jose’s explanation. My short version is, it seems particularly important when using the Python ODBC driver to write the exact query you want rather than a SELECT * or query which returns rows/columns you don’t need.

Comments closed

Custom Model Evaluation Metrics with MLflow

Published 2022-04-22 by Kevin Feasel

Mark Zhang shows off a new bit of functionality in MLflow:

According to an internal customer survey, 75% of respondents say they frequently or always use specialized, business-focused metrics in addition to basic ones like accuracy and loss. Data scientists often utilize these custom metrics as they are more descriptive of business objectives (e.g. conversion rate), and contain additional heuristics not captured by the model prediction itself.
In this blog, we introduce an easy and convenient way of evaluating MLflow models on user-defined custom metrics. With this functionality, a data scientist can easily incorporate this logic at the model evaluation stage and quickly determine the best-performing model without further downstream analysis

Click through to see how to use built-in metrics but also how to create your own.

Comments closed

Iteratively Tuning Graph Neural Networks

Published 2022-04-20 by Kevin Feasel

Luis Bermudez takes us through the process of tuning one flavor of neural network:

We made our own implementations of OGB leaderboard entries for two popular GNN frameworks: GraphSAGE and a Relational Graph Convolutional Network (RGCN). We then designed and executed an iterative experimentation approach for hyperparameter tuning where we seek a quality model that takes minimal time to train. We define quality by running an unconstrained performance tuning loop, and use the results to set thresholds in a constrained tuning loop that optimizes for training efficiency.

Read on to see how they did it.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Python