Press "Enter" to skip to content

Category: Python

Sales Predictions with Pandas

Megan Quinn shows how you can use Pandas and linear regression to predict sales figures:

Pandas is an open-source Python package that provides users with high-performing and flexible data structures. These structures are designed to make analyzing relational or labeled data both easy and intuitive. Pandas is one of the most popular and quintessential tools leveraged by data scientists when developing a machine learning model. The most crucial step in the machine learning process is not simply fitting a model to a given data set. Most of the model development process takes place in the pre-processing and data exploration phase. An accurate model requires good predictors and, in order to acquire them, the user must understand the raw data. Through Pandas’ numerous data wrangling and analysis tools, this important step can easily be achieved. The goal of this blog is to highlight some of the central and most commonly used tools in Pandas while illustrating their significance in model development. The data set used for this demo consists of a supermarket chain’s sales across multiple stores in a variety of cities. The sales data is broken down by items within the stores. The goal is to predict a certain item’s sale.

Click through for an example of the process, including data cleansing and feature extraction, data analysis, and modeling.

Comments closed

Monte Carlo Simulation in Python

Kristian Larsen has a couple of posts on Monte Carlo style simulation in Python. First up is a post which covers how to generate data from different distributions:

One method that is very useful for data scientist/data analysts in order to validate methods or data is Monte Carlo simulation. In this article, you learn how to do a Monte Carlo simulation in Python. Furthermore, you learn how to make different Statistical probability distributions in Python.

You can also bootstrap your data, reusing data points when building a set of samples:

A useful method for data scientists/data analysts in order to validate methods or data is Bootstrap with Monte Carlo simulation In this article, you learn how to do a Bootstrap with Monte Carlo simulation in Python.

Both posts are worth the read.

Comments closed

Hyperparameter Tuning with MLflow

Joseph Bradley shows how you can perform hyperparameter tuning of an MLlib model with MLflow:

Apache Spark MLlib users often tune hyperparameters using MLlib’s built-in tools CrossValidator and TrainValidationSplit.  These use grid search to try out a user-specified set of hyperparameter values; see the Spark docs on tuning for more info.

Databricks Runtime 5.3 and 5.3 ML and above support automatic MLflow tracking for MLlib tuning in Python.

With this feature, PySpark CrossValidator and TrainValidationSplit will automatically log to MLflow, organizing runs in a hierarchy and logging hyperparameters and the evaluation metric.  For example, calling CrossValidator.fit() will log one parent run.  Under this run, CrossValidator will log one child run for each hyperparameter setting, and each of those child runs will include the hyperparameter setting and the evaluation metric.  Comparing these runs in the MLflow UI helps with visualizing the effect of tuning each hyperparameter.

Hyperparameter tuning is critical for some of the more complex algorithms like random forests, gradient boosting, and neural networks.

Comments closed

Linear Regression With Python In Power BI

Emanuele Meazzo builds a linear regression in Power BI using a Python visual:

As a prerequisite, of course, you’ll need to have python installed in your machine, I recommend having an external IDE like Visual Studio Code to write your Python code as the PowerBI window offers zero assistance to coding.

You can follow this article in order to configure Python Correctly for PowerBI.

Step 2 is to add a Python Visual to the page, and let the magic happen.

Click through for the step-by-step instructions, including quite a bit of Python code and a few warnings and limitations.

Comments closed

Learning R Versus Python

Andy Kirk shares the results of a rather informal Twitter poll:

Yesterday I ran a simple Twitter poll about the relative ease of learning R vs. Python. Although a correct answer to this query will ALWAYS have to be based on nuances like pre-existing skills and the scope of need, this originates from people telling me they encounter job or career profiles that list a need for R and/or Python. If they don’t have either, if they prioritised the pursuit of just one, which would be possible to develop a degree of competency more easily, more quickly and more efficiently?

Andy has also created a Twitter moment from the responses.

My thought, based only on the question itself, is that R would be better than Python because the hypothetical person has no additional programming skills. For someone with additional programming skills, the breakdown for me starts with, if your background is statistics, database development, or functional programming, you probably want R; if your background is object-oriented development or imperative programming, you probably want Python. And then it gets nuanced.

Comments closed

Linear Programming in Python

Francisco Alvarez shows us an example of linear programming in Python:

The first two constraints, x1 ≥ 0 and x2 ≥ 0 are called nonnegativity constraints. The other constraints are then called the main constraints. The function to be maximized (or minimized) is called the objective function. Here, the objective function is x1 + x2.

Two classes of problems, called here the standard maximum problem and the standard minimum problem, play a special role. In these problems, all variables are constrained to be nonnegative, and all main constraints are inequalities.

That post spurred me on to look up LINGO and see that it’s actually still around.

Comments closed

From pandas to Spark with koalas

Achilleus tries out Koalas:

Python is widely used programming language when it comes to Data science workloads and Python has way too many different libraries to back this fact. Most of the data scientists are familiar with Python and pandas mostly. But the main issue with Pandas is it works great for small and medium datasets but not so good on big-data workloads. The challenge now becomes to convert the existing pandas code to pyspark code. This is just not straight forward and has a lot of performance hits if python UDFs are used without much care.

Koalas tries to address the first problem ie lessen the friction of learning different APIs to port their existing Pandas code to Pyspark. With Koalas, we can just directly replace the existing pandas code with Koalas. As far as the performance goes, there are no numbers yet as it is still in the initial phase of the project. But this definitely looks promising though.

Read on for some initial thoughts on the product, including a few gotchas.

Comments closed

Vectors for Programmers

John Mount has a couple of videos available:

We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.

Click through for the links, one with Python examples and the other with R examples.

Comments closed

Sentiment Analysis with Python

Bruno Stecanella shows us how to use MonkeyLearn to perform sentiment analysis in Python:

Sentiment analysis is a set of Natural Language Processing (NLP) techniques that takes a text (in more academic circles, a document) written in natural language and extracts the opinions present in the text.

In a more practical sense, our objective here is to take a text and produce a label (or labels) that summarizes the sentiment of this text, e.g. positiveneutral, and negative.

For example, if we were dealing with hotel reviews, we would want the sentence ‘The staff were lovely‘ to be labeled as Positive, and the sentence ‘The shared bathroom was absolutely disgusting‘ labeled as Negative.

Click through for a demo.

Comments closed

Predicting Database Growth

James Livingston uses linear regression to plot database growth over time:

Utilizing the equation for a line, instead of solving for y we will solve for x, where:
– x corresponds to the day we will hit capacity based on current growth rate
– y corresponds to drive capacity in GB
– m is the slope of our regression line, provided by the model via lm.coef_
– b is the intercept of the regression line, also provided by the model via lm.intercept_

Click through for an example. This is one of the areas where DBAs can gain a lot by learning a bit of data science.

Comments closed