Press "Enter" to skip to content

Category: Data Science

Using Convolutional Neural Networks To Recognize Features In Images

Michael Grogan shows how you can use Keras to perform image recognition with a convolutional neural network:

VGG16 is a built-in neural network in Keras that is pre-trained for image recognition.

Technically, it is possible to gather training and test data independently to build the classifier. However, this would necessitate at least 1,000 images, with 10,000 or greater being preferable.

In this regard, it is much easier to use a pre-trained neural network that has already been designed for image classification purposes.

This is probably the best generally available technique for image classification.

Comments closed

Combining Stream Analytics And Azure ML With Power BI

Brad Llewellyn shows us how to feed Azure ML predictions into Power BI via Azure Stream Analytics:

Today, we’re going to talk about combining Stream Analytics with Azure Machine Learning Studio within Power BI.  If you haven’t read the earlier posts in this series, IntroductionGetting Started with R ScriptsClusteringTime Series DecompositionForecastingCorrelationsCustom R VisualsR Scripts in Query EditorPythonAzure Machine Learning Studio and Stream Analytics, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let’s move on to the core of this post, Stream Analytics.

This post is going to build directly on what we created in the two previous posts, Azure Machine Learning Studio and Stream Analytics.  As such, we recommend that you read them before proceeding.

Read on for the demo.

Comments closed

Kafka And The Differing Aims Of Data Professionals

Kai Waehner argues that there is an impedence mismatch between data engineers, data scientists, and ML production engineers:

Data scientists love Python, period. Therefore, the majority of machine learning/deep learning frameworks focus on Python APIs. Both the stablest and most cutting edge APIs, as well as the majority of examples and tutorials use Python APIs. In addition to Python support, there is typically support for other programming languages, including JavaScript for web integration and Java for platform integration—though oftentimes with fewer features and less maturity. No matter what other platforms are supported, chances are very high that your data scientists will build and train their analytic models with Python.

There is an impedance mismatch between model development using Python, its tool stack and a scalable, reliable data platform with low latency, high throughput, zero data loss and 24/7 availability requirements needed for data ingestion, preprocessing, model deployment and monitoring at scale. Python in practice is not the most well-known technology for these requirements. However, it is a great client for a data platform like Apache Kafka.

Click through for the full argument as well as where Kafka can help mitigate some of the issues.

Comments closed

Solving The Monty Hall Problem With R

Miroslav Rajter builds a Monty Hall problem simulator using R:

The original and most simple scenario of the Monty Hall problem is this: You are in a prize contest and in front of you there are three doors (A, B and C). Behind one of the doors is a prize (Car), while behind others is a loss (Goat). You first choose a door (let’s say door A). The contest host then opens another door behind which is a goat (let’s say door B), and then he ask you will you stay behind your original choice or will you switch the door. The question behind this is what is the better strategy?

This is something that puzzled me for a very long time. This is fundamentally a Bayesian problem built around processing new information, and once I understood that, the answer was a lot clearer. H/T R-Bloggers.

Comments closed

codecentric.ai Bootcamp

Shirin Glander announces a free German-language bootcamp:

This bootcamp is a free online course for everyone who wants to learn hands-on machine learning and AI techniques, from basic algorithms to deep learning, computer vision and NLP. However, the course language is German only, but for every chapter I did, you will find an English R-version here on my blog (see below for links).

Right now, the course is in beta phase, so we are happy about everyone who tests our content and leaves feedback. Also, not the entire curriculum is finished yet, we will update and extend the course during the next months. If there are specific topics you’d like to have us cover, just let us know!

If you understand German and want to learn about data science, check this out and leave feedback.

Comments closed

Misinterpretation and Misuse of P-Values and Confidence Intervals

Dave Giles has some good details on common problems of misinterpretation:

There are so many things in statistics (and hence in econometrics) that are easily, and frequently, misinterpreted. Two really obvious examples are p-values and confidence intervals.

I’ve devoted some space in earlier posts to each of these concepts, and their mis-use. For instance, in the case of p-values, see the posts here and here; and for confidence intervals, see here and here.

Click through for more in this vein, including a reference to an interesting-looking paper.

Comments closed

Gartner Advanced Analytics Magic Quadrant Updates

William Vorhies summarizes the changes to the Gartner Advanced Analytics magic quadrant:

The Gartner Magic Quadrant for Data Science and Machine Learning Platforms is just out and once again there are big changes in the leaderboard.  Say what you will about our profession but as a platform developer you certainly can’t rest on your laurels.  Some traditional leaders have fallen (SAS, KNIME, H2Oai, IBM) and some challengers have risen (Alteryx, TIBCO, RapidMiner).

Databricks is making a big push and there’s more movement than usual in this year’s chart. Check it out.

Comments closed

Market Basket Analysis With arulesSequences

Allison Koenecke takes us through the arulesSequences package in R:

In the following tutorial, we answer both questions using the R package arulesSequences [4], which implements the SPADE algorithm [5]. Concretely, given data in an Excel spreadsheet containing historical customer service purchase data, we produce two separate Excel sheet deliverables: a list of service bundles, and a set of temporal rules showing how service bundles evolve over time.  We will focus on interpreting the latter result by showing how to use temporal rules in making predictive sales recommendations.

Our running example below is inspired by the need for Microsoft’s Azure Services salespeople to suggest which additional products to recommend to customers, given the customers’ current cloud product consumption services mix.  We’d like to know, for instance, if customers who have implemented web services also purchase web analytics within the next month.  Actual Azure Service names have been removed for confidentiality reasons.

Market basket analysis is an interesting topic, though in my limited experience, it really falls apart when you have a large number of products to compare, so it tends to work better with toy examples or limited product selections because when you have a 50,000+ SKU inventory, the lift of any individual combination of products rarely gets above the level of noise.

Comments closed

Building Credit Scorecards

Andre Violante uses SAS to build credit scorecards and analyze credit data:

For this analysis I’m using the SAS Open Source library called SWAT (Scripting Wrapper for Analytics Transfer) to code in Python and execute SAS CAS Action Sets. SWAT acts as a bridge between the python language to CAS Action Sets. CAS Action Sets are synonymous to libraries in Python or packages in R. The one main difference and benefit is that the algorithms within these action sets have been highly parallelized to run on a CAS (Cloud Analytic Services) server. The CAS server is a distributed in-memory engine where I can do all my heavy lifting or computations. The code and Jupyter Notebook are available on GitHub.

Click through for the analysis.

Comments closed

P-Hacking and Multiple Comparison Bias

Patrick David has a great article on hypothesis testing, p-hacking, and multiple comparison bias:

The most important part of hypothesis testing is being clear what question we are trying to answer. In our case we are asking:
“Could the most extreme value happen by chance?”
The most extreme value we define as the greatest absolute AMVR deviation from the mean. This question forms our null hypothesis.

Give this one a careful read and try out the code. This is an important topic for anyone who analyzes data to understand.

Comments closed