Press "Enter" to skip to content

Category: Python

Python Cross-Validation

John Mount has some advice if you’re doing cross-validation in Python:

Here is a quick, simple, and important tip for doing machine learning, data science, or statistics in Python: don’t use the default cross validation settings. The default can default to a deterministic, and even ordered split, which is not in general what one wants or expects from a statistical point of view. From a software engineering point of view the defaults may be sensible as since they don’t touch the pseudo-random number generator they are repeatable, deterministic, and side-effect free.

This issue falls under “read the manual”, but it is always frustrating when the defaults are not sufficiently generous.

Click through to see the problem and how you can fix it.

Comments closed

Working with Multiple DataFrames in Pandas

Zehra Can shows how to concatenate and work with multiple data frames at once in Pandas:

Data can be selected from data frames by using loc and iloc options:

Loc is used for selecting rows and columns by index and value label, columns can be selected by column names,

Iloc is used for selecting rows and columns by their indexes.

This is a demo-heavy tutorial style of post, so it has plenty of code to look at.

Comments closed

Machine Learning through Counterfactuals

Amit Sharma announces a new library:

Consider a person who applies for a loan with a financial company, but their application is rejected by a machine learning algorithm used to determine who receives a loan from the company. How would you explain the decision made by the algorithm to this person? One option is to provide them with a list of features that contributed to the algorithm’s decision, such as income and credit score. Many of the current explanation methods provide this information by either analyzing the algorithm’s properties or approximating it with a simpler, interpretable model.

However, these explanations do not help this person decide what to do next to increase their chances of getting the loan in the future. In particular, changing the most important features for prediction may not actually change the decision, and in some cases, important features may be impossible to change, such as age. A similar argument applies when algorithms are used to support decision-makers in scenarios such as screening job applicants, deciding health insurance, or disbursing government aid.

This has the potential to be a great library. One of the issues with machine learning as it stands today is that you can get an answer, but to understand how to change the answer requires having a human understand the model. This looks like a good first step. It’s only available in Python.

Comments closed

Choosing Categorical Features with Python

Mesfin Gebeyaw shows how to use Multiple Correspondence Analysis to filter categorical variables for an analysis:

A general guide to interpreting the multiple correspondence analysis plot shown above for business insights would be to make a note as to how close input categorical features are to the target variable customer churn and to each other. For instance, senior citizens, customers with fiber optic internet service, those with month to month contractual agreements, and single customers or customers with no dependents are being related to a short tenure with the company and a propensity of high risk to churn. On the other hand, customers with more than a year contract, those with DSL internet service, younger customers, customers with multiple lines are being related to a long tenure with the company and a higher tendency to stay with company.

Read the whole thing.

Comments closed

Data Visualization in R and Python

Michelle Golchert contrasts libraries for visualizing data in R and Python:

Unlike R, Python – as a “general-purpose” programming language – does not include data visualization tools by default. However, Python also provides many libraries for this purpose, such as Matplotlib and Seaborn.

Python now also offers numerous packages (like plotnine and ggpy) which are equivalents of ggplot2 in R, and allow you to create plots in Python according to the same “Grammar of Graphics” principle.

This is an area where I think R has the upper hand at most levels: it’s easier to get started plotting with R (thanks to the built-in plots), it’s easier to do “intermediate-quality” plots (stuff you would use in an internal presentation), and you tend to have more control when building professional-quality plots. You can certainly create beautiful visuals in both languages, though.

Comments closed

Schiphol Takeoff: Low-Code Automated Deployment

Tim van Cann and Daniel van der Ende have an open source project for automatic deployment on Azure:

To give a bit more insight into why we built Schiphol Takeoff, it’s good to take a look at an example use case. This use case ties a number of components together:

– Data arrives in a (near) real-time stream on an Azure Eventhub.
– A Spark job running on Databricks consumes this data from Eventhub, processes the data, and outputs predictions.
– A REST API is running on Azure Kubernetes Service, which exposes the predictions made by the Spark job.

Conceptually, this is not a very complex setup. However, there are quite a few components involved:

– Azure Eventhub
– Azure Databricks
– Azure Kubernetes Service

Each of these individually has some form of automation, but there is no unified way of coordinating and orchestrating deployment of the code to all at the same time. If, for example, you were to change the name of the consumer group for Azure Eventhub, you could script that. However, you’d also need to manually update your Spark job running on Databricks to ensure it could still consume the data.

This looks pretty nice. I’ll need to dive into it some more.

Comments closed

Azure Databricks and Delta Lake

Brad Llewellyn starts a new series on Delta Lake in Azure Databricks:

Saving the data in Delta format is as simple as replacing the .format(“parquet”) function with .format(“delta”).  However, we see a major difference when we look at the table creation.  When creating a table using Delta, we don’t have to specify the schema, because the schema is already strongly defined when we save the data.  We also see that Delta tables can be easily queried using the same SQL we’re used to.  Next, let’s compare what the raw files look like by examining the blob storage container that we are storing them in.

There are some good demos in this post and it promises to be a nice series.

Comments closed

Using SQL Server as a REST API Back-End

Davide Mauri shows how you can use SQL Server to power an API, using Flask as an example:

I mentioned in my previous article that having native JSON support in Azure SQL it’s a game changer as it profoundly change the way a developer can interact with a relational database, bringing the simplicity and the flexibility needed in today’s Modern Applications.

As Python is becoming immensely popular, one of the most common tasks for a developer is to create REST API using Python. Thanks to JSON support, using Azure SQL as a backend database to support your API is as easy as writing to a text file, with the difference that behind the scenes you have all the peace of mind that your data will be safely stored and made available on request, at scale, with also the option to push as much compute to data as you want, so that you can leverage the powerful query and processing engine while keeping your code simple, elegant and agile, with a clear separation of concerns. All these things will help you immensely once you’ll start to evolve your project to keep it updated with today’s demanding and ever-changing world.

Those who remember the days of ASMX web services in SQL Server (thankfully removed after 2005) might cringe, but I’ve actually done something like this for a company, where all of the data lived in SQL Server and the transformation logic was pretty simple. If you have to monkey with the JSON afterward in your middle tier, then just bring back a data set, but in a scenario like Davide shows, moving the JSON creation to Python wouldn’t really gain you anything.

Comments closed

Querying SQL Server from Python

Hasan Savran builds an Azure Data Studio notebook to query SQL Server from Python:

SQL Kernel is the default language, to query database with Python change SQL to Python 3. Probably, you will see the following message if this is the first time you are trying this. You need to install Python packages to be able to run python scripts. I have Visual Studio installed on my machine and I already have Python, I taught I could to use it by clicking “Use existing Python installation”. I was wrong, I couldn’t. This option looks for local installation files and when I point to Visual Studio Python files, it throws error in the middle of the installation. So, I will ignore this option for now.

In ADS, I haven’t gotten “Use existing Python location” to work either, so Hasan’s not alone in that regard.

Comments closed

Re-Introducing rquery

John Mount has a new introduction to rquery:

rquery is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of R’s base::transform(), or dplyr’s dplyr::mutate() and uses a pipe in the style popularized in R with magrittr. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional SQL “window functions.” More on the background and context of rquery can be found here.

The R/rquery version of this introduction is here, and the Python/data_algebra version of this introduction is here.

Check it out.

Comments closed