Category: Data Science

RF is based on decision trees. In machine learning decision trees are a technique for creating predictive models. They are called decision trees because the prediction follows several branches of “if… then…” decision splits – similar to the branches of a tree. If we imagine that we start with a sample, which we want to predict a class for, we would start at the bottom of a tree and travel up the trunk until we come to the first split-off branch. This split can be thought of as a feature in machine learning, let’s say it would be “age”; we would now make a decision about which branch to follow: “if our sample has an age bigger than 30, continue along the left branch, else continue along the right branch”. This we would do until we come to the next branch and repeat the same decision process until there are no more branches before us. This endpoint is called a leaf and in decision trees would represent the final result: a predicted class or value.

At each branch, the feature thresholds that best split the (remaining) samples locally is found. The most common metrics for defining the “best split” are gini impurity and information gain for classification tasks and variance reduction for regression.

Click through for more info and if you understand German, the video is good as well.

Comments closed

Data Science And Data Engineering In HDP 3.0

Published 2018-10-17 by Kevin Feasel

Saumitra Buragohain, et al, show off some of the things added to the Hortonworks Data Platform for data scientists and data engineers:

We leverage the power of HDP 3.0 from efficient storage (erasure coding), GPU pooling to containerized TensorFlow and Zeppelin to enable this use case. We will the save the details for a different blog (please see the video)- to summarize, as we trained the car on a track, we collected about 30K images with corresponding steering angle data. The training data was stored in a HDP 3.0 cluster and the TensorFlow model was trained using 6 GPU cards and then the model was deployed back on the car. The deep learning use case highlights the combined power of HDP 3.0.

Click through for more additions and demos.

Comments closed

Multi-Threaded R With Microsoft R Client

Published 2018-10-10 by Kevin Feasel

David Parr shows us how to get started with Microsoft R Client and performs some quick benchmarking:

This message will pop up, and it’s worth noting as it’s got some information in it that you might need to think about:

It’s worth noting that right now Microsoft r Client is lagging behind the current R version, and is based on version 3.4 of R, not 3.5. This will mean your default package libraries will not be shared between the installations if you are running R 3.5.
It’s using a snapshot of CRAN called MRAN to source packages by default. 90% of the time it will operate just as you expect, but because it takes a ‘snapshot’ of packages, newer features and changes that have hit CRAN may not be in the version of the package you are grabbing.
- RevoScaleR and probably the ggplot2 and dplyr packages will likely be installed for you already as default in Microsoft R Client. The other two you will probably have to install yourself.
Intel MKL will have scanned your system on install and attempted to work out how many cores your processor has. Here it’s identified 2 on my old Lenovo Yoga. This is where the speed boost will come from.

I had an old two-core Lenovo Yoga too, so this article really spoke to me.

Comments closed

Image Clustering With Keras And R

Published 2018-10-09 by Kevin Feasel

Shirin Glander shows us how to use R to extract learned features from Keras and cluster those features:

For each of these images, I am running the predict() function of Keras with the VGG16 model. Because I excluded the last layers of the model, this function will not actually return any class predictions as it would normally do; instead we will get the output of the last layer: block5_pool (MaxPooling2D).

These, we can use as learned features (or abstractions) of the images. Running this part of the code takes several minutes, so I save the output to a RData file (because I samples randomly, the classes you see below might not be the same as in the sample_fruits list above).

Read the whole thing.

Comments closed

Using plm To Analyze Panel Data

Published 2018-10-09 by Kevin Feasel

Michael Grogan shows us how to use the plm package to perform linear regression against panel data:

Types of data

Cross-Sectional: Data collected at one particular point in time

Time Series: Data collected across several time periods

Panel Data: A mixture of both cross-sectional and time series data, i.e. collected at a particular point in time and across several time periods

Fixed Effects: Effects that are independent of random disturbances, e.g. observations independent of time.

Random Effects: Effects that include random disturbances.

Let us see how we can use the plm library in R to account for fixed and random effects. There is a video tutorial link at the end of the post.

Read on for an example.

Comments closed

Multi-Class Classification With vtreat

Published 2018-10-04 by Kevin Feasel

John Mount has an example of using the vtreat package for multi-class classification in R:

vtreat is a powerful R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).

In addition vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.

The two functions needed (mkCrossFrameMExperiment() and the S3 method prepare.multinomial_plan()) are now part of vtreat.

Click through for an example of this in action.

Comments closed

Distance Between Strings: Levenshtein Distance

Published 2018-10-03 by Kevin Feasel

Nikhil Babar has an introduction to the Levenshtein distance algorithm:

The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions, or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who discovered this equation in 1965.

Levenshtein distance may also be referred to as edit distance, although it may also denote a larger family of distance metrics. It is closely related to pairwise string alignments.

Read on for an explanation and example. Levenshtein is a great way of calculating string similarities, possibly helping you with tasks like data cleansing by finding typos or alternate spellings, or matching down parts of street addresses.

Comments closed

Scaling Anomaly Detection With Kafka And Cassandra

Published 2018-10-01 by Kevin Feasel

Paul Brebner has started a series on anomaly detection using Kafka and Cassandra, starting with an introduction:

Let’s look at the application domain in more detail. In the previous blog series on Kongo, a Kafka focussed IoT logistics application, we persisted business “violations” to Cassandra for future use using Kafka Connect. For example, we could have used the data in Cassandra to check and certify that a delivery was free of violations across its complete storage and transportation chain.

An appropriate scenario for a Platform application involving Kafka and Cassandra has the following characteristics:

Large volumes of streaming data is ingested into Kafka (at variable rates)
Data is sent to Cassandra for long term persistence
Streams processing is triggered by the incoming events in real-time
Historic data is requested from Cassandra
Historic data is retrieved from Cassandra
Historic data is processed, and
A result is produced.

It looks like he’s focusing on changepoint detection, which is one of several good techniques for generalized anomaly detection. I’ll be interested in following this series.

Comments closed

Hadoop + SQL Server In 2019

Published 2018-09-27 by Kevin Feasel

Travis Wright shows off a big part of what the SQL Server team has been working on the last couple of years:

SQL Server 2019 big data clusters provide a complete AI platform. Data can be easily ingested via Spark Streaming or traditional SQL inserts and stored in HDFS, relational tables, graph, or JSON/XML. Data can be prepared by using either Spark jobs or Transact-SQL (T-SQL) queries and fed into machine learning model training routines in either Spark or the SQL Server master instance using a variety of programming languages, including Java, Python, R, and Scala. The resulting models can then be operationalized in batch scoring jobs in Spark, in T-SQL stored procedures for real-time scoring, or encapsulated in REST API containers hosted in the big data cluster.

SQL Server big data clusters provide all the tools and systems to ingest, store, and prepare data for analysis as well as to train the machine learning models, store the models, and operationalize them.
Data can be ingested using Spark Streaming, by inserting data directly to HDFS through the HDFS API, or by inserting data into SQL Server through standard T-SQL insert queries. The data can be stored in files in HDFS, or partitioned and stored in data pools, or stored in the SQL Server master instance in tables, graph, or JSON/XML. Either T-SQL or Spark can be used to prepare data by running batch jobs to transform the data, aggregate it, or perform other data wrangling tasks.

Data scientists can choose either to use SQL Server Machine Learning Services in the master instance to run R, Python, or Java model training scripts or to use Spark. In either case, the full library of open-source machine learning libraries, such as TensorFlow or Caffe, can be used to train models.

Lastly, once the models are trained, they can be operationalized in the SQL Server master instance using real-time, native scoring via the PREDICT function in a stored procedure in the SQL Server master instance; or you can use batch scoring over the data in HDFS with Spark. Alternatively, using tools provided with the big data cluster, data engineers can easily wrap the model in a REST API and provision the API + model as a container on the big data cluster as a scoring microservice for easy integration into any application.

I’ve wanted Spark integration ever since 2016 and we’re going to get it.

Comments closed

Wasting Money With Data Science

Published 2018-09-25 by Kevin Feasel

Giovanni Lanzani has a post with the controversial title above:

Some data is gathered, given to data scientists, and — after two weeks — the first demo takes place. The results are promising, but they need a bit more time.

Fine. After all, the data was messy: they had to clean it up and go back to the source a couple of times.

Two weeks pass and the new results are even nicer. With 70% accuracy, they can predict if a patient will go home after their visit to the emergency room.

This is much better than random (50%)! A full-fledged pilot starts.

They are faced with a couple of challenges to go from model to data product:

How to send the source data to the model is unclear;
Where the model should run;
The hospital operations need to change, as the intake happens with pen and paper;
They realize that without knowing to which department the patient will go, they won’t add any value;
To predict the department, the model need the diagnosis. But once the diagnosis gets typed in the computer, the patient has reached their destination: the model is useless!

I think it’s a fair point: it’s easy from the standpoint of internal researchers to look for things which they can do, but which don’t have much business value. The risk on the other side is that you’ll start diving into a high-potential-value problem and then realize that the data isn’t there to draw conclusions or that the relationships you expected simply aren’t there.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31