Category: Python

The Basics of Apache Airflow

Published 2019-08-26 by Kevin Feasel

Divyansh Jain explains what Apache Airflow is and takes us through a sample solution:

Airflow is a platform to programmatically author, schedule & monitor workflows or data pipelines. These functions achieved with Directed Acyclic Graphs (DAG) of the tasks. It is an open-source and still in the incubator stage. It was initialized in 2014 under the umbrella of Airbnb since then it got an excellent reputation with approximately 800 contributors on GitHub and 13000 stars. The main functions of Apache Airflow is to schedule workflow, monitor and author.

It’s another interesting product in the Hadoop ecosystem and has additional appeal outside of that space.

Comments closed

Python versus R (Again)

Published 2019-08-16 by Kevin Feasel

Alex Woodie looks at whether Python is dominating R in the data science space:

There is some evidence that Python’s popularity is hurting R usage. According to the TIOBE Index, Python is currently the third most popular language in the world, behind perennial heavyweights Java and C. From August 2018 to August 2019, Python usage surged by more than 3% to achieve a 10% rating (TIOBE’s proprietary metric that primarily measures search activity), easily the biggest gain among the 20 most popular languages.
R, by contrast, has not fared well lately on the TIOBE Index, where it dropped from 8th place in January 2018 to become the 20th most popular language today, behind Perl, Swift, and Go. At its peak in January 2018, R had a popularity rating of about 2.6%. But today it’s down to 0.8%, according to the TIOBE index.

I’ll say that rumors of R’s demise are premature.

Comments closed

Microsoft ML Server 9.4

Published 2019-07-31 by Kevin Feasel

Jeroen Ter Heerdt announces Microsoft Machine Learning Server 9.4:

Today we’re excited to announce our latest Microsoft Machine Learning Server 9.4 release, which addresses popular customer requests as well as developments in the R and Python community.
Microsoft Machine Learning Server is your flexible enterprise platform for analyzing data at scale, building intelligent apps, and discovering valuable insights across your business with full support for Python and R. Machine Learning Server meets the needs of all constituents of the process – from data engineers and data scientists to line-of-business programmers and IT professionals. It offers a choice of languages and features and algorithmic innovation that brings the best of open source and proprietary worlds together.

This is the best way to bind new versions of R and Python to your SQL Server ML Services installation.

Comments closed

Building an Image Classifier with PyTorch

Published 2019-07-19 by Kevin Feasel

Rogier van der Geer shows how you can use PyTorch to build out a Convolutional Neural Network for image classification:

The tool that we are going to use to make a classifier is called a convolutional neural network, or CNN. You can find a great explanation of what these are right here on wikipedia.
But we are not going to fully train one ourselves: that would take way more time than I would be willing to spend. Instead, we are going to do transfer learning, where we take a pre-trained CNN and replace only the last layer by a layer of our own. Then we only need to train that single layer, as all the other layers already have weights that are quite sensible. Here we exploit the fact that the images we are interested in have a lot of the same properties as those images that the original network was trained on. You can find a great explanation of transfer learning here.

Read on for a detailed example.

Comments closed

A Quick Keras Example

Published 2019-07-15 by Kevin Feasel

Shubham Dangare takes us through a quick example using Keras and TensorFlow in Python:

Keras is a high-level neural networks API, written in Python and capable of running on top of Tensorflow, CNTK or Theano. It was developed with a focus on enabling fast experimentation. In this blog, we are going to cover one small case study for fashion mnist.
Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28×28 grayscale image, associated with a label from 10 classes. Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.

The end result wasn’t that great, but Shubham was using a sequential model rather than a convolutional neural network, so you can probably take this as a starting point and improve upon it.

Comments closed

ML Services and Injectable Code

Published 2019-07-09 by Kevin Feasel

Grant Fritchey looks at sp_execute_external_script for potential SQL injection vulnerabilities:

The sharp eyed will see that the data set is defined by SQL. So, does that suffer from injection attacks? Short answer is no. If there was more than one result set within the Python code, it’s going to error out. So you’re protected there.
This is important, because the data set query can be defined with parameters. You can pass values to those parameters, heck, you’re likely to pass values to those parameters, from the external query or procedure. So, is that an attack vector?
No.

Another factor is that you need explicitly to grant EXECUTE ANY EXTERNAL SCRIPT rights to non-sysadmin, non-db_owner users, meaning a non-privileged user can’t execute external scripts at all. You can also limit the executing service account

Comments closed

Pandas Multiindex and T-SQL

Published 2019-06-28 by Kevin Feasel

Tomaz Kastrun explains why you should never cross the streams:

1. SQL Server and Python Pandas Indexes are two different worlds and should not be mixed.
2. SQL Server uses Index primarily for DML operations and to keep data ACID.
3. Python Pandas uses Index and MultiIndex for keeping data dimensionality when performing data wrangling and statistical analysis.
4. SQL Server Index and Python Pandas Index don’t know about each other’s existence, meaning if user want to propagate the T-SQL index to Python Pandas (in order to minimize the impact of duplicates, missing values or to impose the relational model), it needs to be introduced and created, once data enters “in the python world”.

Read on for additional conclusions and the demos which bring us here.

Comments closed

Lasso and Ridge Regression in Python

Published 2019-06-26 by Kevin Feasel

Kristian Larsen shows off a few regression techniques using Python:

Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Therefore, when you conduct a regression model it can be helpful to do a lasso regression in order to predict how many variables your model should contain. This secures that your model is not overly complex and prevents the model from over-fitting which can result in a biased and inefficient model.

Read on for demonstrations.

Comments closed

Sales Predictions with Pandas

Published 2019-06-19 by Kevin Feasel

Megan Quinn shows how you can use Pandas and linear regression to predict sales figures:

Pandas is an open-source Python package that provides users with high-performing and flexible data structures. These structures are designed to make analyzing relational or labeled data both easy and intuitive. Pandas is one of the most popular and quintessential tools leveraged by data scientists when developing a machine learning model. The most crucial step in the machine learning process is not simply fitting a model to a given data set. Most of the model development process takes place in the pre-processing and data exploration phase. An accurate model requires good predictors and, in order to acquire them, the user must understand the raw data. Through Pandas’ numerous data wrangling and analysis tools, this important step can easily be achieved. The goal of this blog is to highlight some of the central and most commonly used tools in Pandas while illustrating their significance in model development. The data set used for this demo consists of a supermarket chain’s sales across multiple stores in a variety of cities. The sales data is broken down by items within the stores. The goal is to predict a certain item’s sale.

Click through for an example of the process, including data cleansing and feature extraction, data analysis, and modeling.

Comments closed

Monte Carlo Simulation in Python

Published 2019-06-14 by Kevin Feasel

Kristian Larsen has a couple of posts on Monte Carlo style simulation in Python. First up is a post which covers how to generate data from different distributions:

One method that is very useful for data scientist/data analysts in order to validate methods or data is Monte Carlo simulation. In this article, you learn how to do a Monte Carlo simulation in Python. Furthermore, you learn how to make different Statistical probability distributions in Python.

You can also bootstrap your data, reusing data points when building a set of samples:

A useful method for data scientists/data analysts in order to validate methods or data is Bootstrap with Monte Carlo simulation In this article, you learn how to do a Bootstrap with Monte Carlo simulation in Python.

Both posts are worth the read.

Comments closed