Press "Enter" to skip to content

Category: Data Science

Image Clustering With Keras And R

Shirin Glander shows us how to use R to extract learned features from Keras and cluster those features:

For each of these images, I am running the predict() function of Keras with the VGG16 model. Because I excluded the last layers of the model, this function will not actually return any class predictions as it would normally do; instead we will get the output of the last layer: block5_pool (MaxPooling2D).

These, we can use as learned features (or abstractions) of the images. Running this part of the code takes several minutes, so I save the output to a RData file (because I samples randomly, the classes you see below might not be the same as in the sample_fruits list above).

Read the whole thing.

Comments closed

Using plm To Analyze Panel Data

Michael Grogan shows us how to use the plm package to perform linear regression against panel data:

Types of data

  • Cross-Sectional: Data collected at one particular point in time
  • Time Series: Data collected across several time periods
  • Panel Data: A mixture of both cross-sectional and time series data, i.e. collected at a particular point in time and across several time periods
  • Fixed Effects: Effects that are independent of random disturbances, e.g. observations independent of time.
  • Random Effects: Effects that include random disturbances.

Let us see how we can use the plm library in R to account for fixed and random effects. There is a video tutorial link at the end of the post.

Read on for an example.

Comments closed

Multi-Class Classification With vtreat

John Mount has an example of using the vtreat package for multi-class classification in R:

vtreat is a powerful R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).

In addition vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.

The two functions needed (mkCrossFrameMExperiment() and the S3 method prepare.multinomial_plan()) are now part of vtreat.

Click through for an example of this in action.

Comments closed

Distance Between Strings: Levenshtein Distance

Nikhil Babar has an introduction to the Levenshtein distance algorithm:

The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions, or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who discovered this equation in 1965.

Levenshtein distance may also be referred to as edit distance, although it may also denote a larger family of distance metrics. It is closely related to pairwise string alignments.

Read on for an explanation and example.  Levenshtein is a great way of calculating string similarities, possibly helping you with tasks like data cleansing by finding typos or alternate spellings, or matching down parts of street addresses.

Comments closed

Scaling Anomaly Detection With Kafka And Cassandra

Paul Brebner has started a series on anomaly detection using Kafka and Cassandra, starting with an introduction:

Let’s look at the application domain in more detail. In the previous blog series on Kongo, a Kafka focussed IoT logistics application, we persisted business “violations” to Cassandra for future use using Kafka Connect. For example, we could have used the data in Cassandra to check and certify that a delivery was free of violations across its complete storage and transportation chain.

An appropriate scenario for a Platform application involving Kafka and Cassandra has the following characteristics:

  1. Large volumes of streaming data is ingested into Kafka (at variable rates)

  2. Data is sent to Cassandra for long term persistence

  3. Streams processing is triggered by the incoming events in real-time

  4. Historic data is requested from Cassandra

  5. Historic data is retrieved from Cassandra

  6. Historic data is processed, and

  7. A result is produced.

It looks like he’s focusing on changepoint detection, which is one of several good techniques for generalized anomaly detection.  I’ll be interested in following this series.

Comments closed

Hadoop + SQL Server In 2019

Travis Wright shows off a big part of what the SQL Server team has been working on the last couple of years:

SQL Server 2019 big data clusters provide a complete AI platform. Data can be easily ingested via Spark Streaming or traditional SQL inserts and stored in HDFS, relational tables, graph, or JSON/XML. Data can be prepared by using either Spark jobs or Transact-SQL (T-SQL) queries and fed into machine learning model training routines in either Spark or the SQL Server master instance using a variety of programming languages, including Java, Python, R, and Scala. The resulting models can then be operationalized in batch scoring jobs in Spark, in T-SQL stored procedures for real-time scoring, or encapsulated in REST API containers hosted in the big data cluster.

SQL Server big data clusters provide all the tools and systems to ingest, store, and prepare data for analysis as well as to train the machine learning models, store the models, and operationalize them.
Data can be ingested using Spark Streaming, by inserting data directly to HDFS through the HDFS API, or by inserting data into SQL Server through standard T-SQL insert queries. The data can be stored in files in HDFS, or partitioned and stored in data pools, or stored in the SQL Server master instance in tables, graph, or JSON/XML. Either T-SQL or Spark can be used to prepare data by running batch jobs to transform the data, aggregate it, or perform other data wrangling tasks.

Data scientists can choose either to use SQL Server Machine Learning Services in the master instance to run R, Python, or Java model training scripts or to use Spark. In either case, the full library of open-source machine learning libraries, such as TensorFlow or Caffe, can be used to train models.

Lastly, once the models are trained, they can be operationalized in the SQL Server master instance using real-time, native scoring via the PREDICT function in a stored procedure in the SQL Server master instance; or you can use batch scoring over the data in HDFS with Spark. Alternatively, using tools provided with the big data cluster, data engineers can easily wrap the model in a REST API and provision the API + model as a container on the big data cluster as a scoring microservice for easy integration into any application.

I’ve wanted Spark integration ever since 2016 and we’re going to get it.

Comments closed

Wasting Money With Data Science

Giovanni Lanzani has a post with the controversial title above:

Some data is gathered, given to data scientists, and — after two weeks — the first demo takes place. The results are promising, but they need a bit more time.

Fine. After all, the data was messy: they had to clean it up and go back to the source a couple of times.

Two weeks pass and the new results are even nicer. With 70% accuracy, they can predict if a patient will go home after their visit to the emergency room.

This is much better than random (50%)! A full-fledged pilot starts.

They are faced with a couple of challenges to go from model to data product:

  • How to send the source data to the model is unclear;

  • Where the model should run;

  • The hospital operations need to change, as the intake happens with pen and paper;

  • They realize that without knowing to which department the patient will go, they won’t add any value;

  • To predict the department, the model need the diagnosis. But once the diagnosis gets typed in the computer, the patient has reached their destination: the model is useless!

I think it’s a fair point:  it’s easy from the standpoint of internal researchers to look for things which they can do, but which don’t have much business value.  The risk on the other side is that you’ll start diving into a high-potential-value problem and then realize that the data isn’t there to draw conclusions or that the relationships you expected simply aren’t there.

Comments closed

Be Careful Of P-Hacking

Vincent Granville discusses the problem of p-hacking:

I read an article this morning, about a top Cornell food researcher having 13 studies retracted, see here. It prompted me to write this blog. It is about data science charlatans and unethical researchers in the Academia, destroying the value of p-values again, using a well known trick called p-hacking, to get published in top journals and get grant money or tenure. The issue is widespread, not just in academic circles, and make people question the validity of scientific methods. It fuels the fake “theories” of those who have lost faith in science.

The trick consists of repeating an experiment sufficiently many times, until the conclusions fit with your agenda. Or by being cherry-picking about the data you use, or even discarding observations deemed to have a negative impact on conclusions. Sometimes, causation and correlations are mixed up on purpose, or misleading charts are displayed. Sometimes, the author lacks statistical acumen.

Usually, these experiments are not reproducible. Even top journals sometimes accept these articles, due to

  • Poor peer-review process

  • Incentives to publish sensational material

Wansink is a charlatan.  But beyond p-hacking is Andrew Gelman and Eric Loken’s Garden of Forking Paths.  Gelman’s blog, incidentally (example), is where I originally learned about Wansink’s shady behaviors.  Gelman also warns us not to focus on the procedural, but instead on a deeper problem.

1 Comment

Calculating Lifetime Value With R

Sergey Bryl shows how to calculate the lifetime value of a subscription service:

Predicting LTV is a common issue for a new, recently launched product/service/application when we don’t have a lot of historical data but want to calculate LTV as soon as possible. Even though we may have a lot of historical data on customer payments for a product that is active for years, we can’t really trust earlier stats since the churn curve and LTV can differ significantly between new customers and the current ones due to a variety of reasons.

Therefore, regardless of whether our product is new or “old”, we attract new subscribers and want to estimate what revenue they will generate during their lifetimes for business decision-making.

This topic is closely connected to the Cohort Analysis and if you are not familiar with the concept, I recommend that you read about it and look at other articles I wrote earlier on this blog.

Read the whole thing.

Comments closed

Interpreting The Area Under The Receiver Operating Characteristic Curve

Roos Colman explains what a Receiver Operating Characteristic (ROC) curve is and how we interpret the Area Under the Curve (AUC):

The AUC can be defined as “The probability that a randomly selected case will have a higher test result than a randomly selected control”. Let’s use this definition to calculate and visualize the estimated AUC.
In the figure below, the cases are presented on the left and the controls on the right.
Since we have only 12 patients, we can easily visualize all 32 possible combinations of one case and one control. (Rcode below)

Expanding from this easy-to-follow example, Colman walks us through some of the statistical tests involved.  Check it out.

Comments closed