Python – Page 24 – Curated SQL

In Python, we do not have a character data type. It uses Unicode characters for the string. It also considers a single character as a string. Sometimes, we need to split a string based on the separator defined. It is similar to a text to columns feature in Microsoft Excel.

Click through for a number of examples.

Comments closed

Avoiding Loops in Python with NumPy

Published 2020-04-17 by Kevin Feasel

Swantika Gupta walks us through vectorization and broadcasting with NumPy:

Vectorization is a powerful ability within NumPy which is used to speed up the code execution without using loop. It expresses operations as occurring on entire arrays rather than their individual elements.
Looping over an array or any data structure in Python has a lot of overhead involved. In NumPy, Vectorized Operations delegates the looping internally to highly optimized C and Fortran functions, making for cleaner and faster Python code. So, vectorization refers to the concept of replacing explicit for-loops with array expressions, which can then be computed internally with a low-level language, like C.

Read on for a few examples of this and broadcasting.

Comments closed

Time Series Forecasting Best Practices

Published 2020-04-15 by Kevin Feasel

David Smith talks about a new GitHub repo:

The repository includes detailed examples of various time series modeling techniques, as Jupyter Notebooks for Python, and R Markdown documents for R. It also includes Python notebooks to fit time series models in the Azure Machine Learning service, and then operationalize the forecasts as a web service.
The R examples demonstrate several techniques for forecasting time series, specifically data on refrigerated orange juice sales from 83 stores (sourced from the the bayesm package). The forecasting techniques vary (mean forecasting with interpolation, ARIMA, exponential smoothing, and additive models), but all make extensive use of the tidyverts suite of packages, which provides “tidy time series forecasting for R“. The forecasting methods themselves are explained in detail in the book (readable online) Forecasting: Principles and Practice by Rob J Hyndman and George Athanasopoulos (Monash University).

This looks really cool.

Comments closed

Distributed XGBoost in Cloudera

Published 2020-04-13 by Kevin Feasel

Harshal Patil walk us through the XGBoost algorithm and shows how we can use it in Cloudera Machine Learning:

DASK is an open-source parallel computing framework – written natively in Python – that integrates well with popular Python packages such as Numpy, Pandas, and Scikit-Learn. Dask was initially released around 2014 and has since built significant following and support.
DASK uses Python natively, distinguishing it from Spark, which is written in Java, and has the overhead of running JVMs and context switching between Python and Java. It is also much harder to debug Spark errors vs. looking at a Python stack trace that comes from DASK.
We will run Xgboost on DASK to train in parallel on CML. The source code for this blog can be found here.

Click through for the process.

Comments closed

Tips for Moving from Pandas to Koalas

Published 2020-04-03 by Kevin Feasel

Haejoon Lee, et al, walk us through migrating existing code written for Pandas to use the Koalas library:

In particular, two types of users benefit the most from Koalas:
– pandas users who want to scale out using PySpark and potentially migrate codebase to PySpark. Koalas is scalable and makes learning PySpark much easier
– Spark users who want to leverage Koalas to become more productive. Koalas offers pandas-like functions so that users don’t have to build these functions themselves in PySpark
This blog post will not only demonstrate how easy it is to convert code written in pandas to Koalas, but also discuss the best practices of using Koalas; when you use Koalas as a drop-in replacement of pandas, how you can use PySpark to work around when the pandas APIs are not available in Koalas, and when you apply Koalas-specific APIs to improve productivity, etc. The example notebook in this blog can be found here.

Read on to learn more.

Comments closed

Fun with Python: Calculating Pi

Published 2020-03-24 by Kevin Feasel

Jon Fletcher implements a method of estimating the value of Pi:

This series converges to Pi, the more terms that are added to the series, the closer the value is to Pi.
For the proof on why this series converges to Pi – https://proofwiki.org/wiki/Leibniz’s_Formula_for_Pi
There are several points to note about the series:
– It’s infinite, we need to find a way to continue adding term after term.
– The denominator of the fraction increases by 2 every term.
– The terms alternate between positive and negative.

Click through for the implementation of the formula in Python. And what you should do if you really need to reference Pi in your Python code.

Comments closed

Distributed Model Training with Dask and SciKit-Learn

Published 2020-03-18 by Kevin Feasel

Matthieu Lamairesse shows us how we can use Dask to perform distributed ML model training:

Dask is an open-source parallel computing framework written natively in Python (initially released 2014). It has a significant following and support largely due to its good integration with the popular Python ML ecosystem triumvirate that is NumPy, Pandas and Scikit-learn.
Why Dask over other distributed machine learning frameworks?
In the context of this article it’s about Dask’s tight integration with Sckit-learn’s JobLib parallel computing library that allows us to distribute Scikit-learn code with (almost) no code change, making it a very interesting framework to accelerate ML training.

Click through for an interesting article and an example of using this on Cloudera’s ML platform.

Comments closed

Working with Azure ML Notebooks

Published 2020-03-16 by Kevin Feasel

Leila Etaati takes us through notebooks in Azure Machine Learning:

The new Azure ML environment contain a Azur Notebook that you able to write the python code there. In this post, I will go through the experiment and see how we can use this environment for the aim of regression analysis.

Click through for the screenshot-laden demo.

Comments closed

Using Python to Pivot Data in SQL Server

Published 2020-03-16 by Kevin Feasel

Rajendra Gupta shows a few ways to pivot data using Python:

We can use groupby and lambda functions as well in the Python scripts for Pivot tables. For this example, I have a data set of a few states of India and their cities in a SQL table.
We need a pivot table from this data. In the output, it should list all cities for a state in a column; it should use || as a city name separator.

This is an unorthodox but interesting use of Machine Learning Services.

Comments closed

Python Cross-Validation

Published 2020-03-12 by Kevin Feasel

John Mount has some advice if you’re doing cross-validation in Python:

Here is a quick, simple, and important tip for doing machine learning, data science, or statistics in Python: don’t use the default cross validation settings. The default can default to a deterministic, and even ordered split, which is not in general what one wants or expects from a statistical point of view. From a software engineering point of view the defaults may be sensible as since they don’t touch the pseudo-random number generator they are repeatable, deterministic, and side-effect free.
This issue falls under “read the manual”, but it is always frustrating when the defaults are not sufficiently generous.

Click through to see the problem and how you can fix it.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Category: Python

Splitting and Concatenating Strings with Python

Avoiding Loops in Python with NumPy

Time Series Forecasting Best Practices

Distributed XGBoost in Cloudera

Tips for Moving from Pandas to Koalas

Fun with Python: Calculating Pi

Distributed Model Training with Dask and SciKit-Learn

Working with Azure ML Notebooks

Using Python to Pivot Data in SQL Server

Python Cross-Validation