Press "Enter" to skip to content

Category: Data Science

Finding Near-Duplicates in a Corpus

Estelle Wang de-dupes text data:

Building a large high-quality corpus for Natural Language Processing (NLP) is not for the faint of heart. Text data can be large, cumbersome, and unwieldy and unlike clean numbers or categorical data in rows and columns, discerning differences between documents can be challenging. In organizations where documents are shared, modified, and shared again before being saved in an archive, the problem of duplication can become overwhelming.

To find exact duplicates, matching all string pairs is the simplest approach, but it is not a very efficient or sufficient technique. Using the MD5 or SHA-1 hash algorithms can get us a correct outcome with a faster speed, yet near-duplicates would still not be on the radar. Text similarity is useful for finding files that look alike. There are various approaches to this and each of them has its own way to define documents that are considered duplicates. Furthermore, the definition of duplicate documents has implications for the type of processing and the results produced. Below are some of the options.

Click through for solutions in SAS.

Comments closed

The Basics of Automating Data Cleaning

Vincent Granville provides some guidance:

To the junior data scientist, it looks like each new dataset comes with a new set of challenges. It seems that you can not automate data cleaning. To decisions makers and stakeholders, this problem is so remote to them that they don’t even know the amount of resources wasted on this. To them, it seems obvious that automation is the way to go, but they may underestimate the challenges. It is usually not a high priority in many organizations, despite how much money it costs.

Yet, there are at most a few dozens of issues that come with data cleaning. Not a few thousands, not a few hundreds. You can catalog them and address all of them at once with a piece of code. One that you can reuse each time when you face a new data set. I describe here the main issues and how to address them. Automating the data cleaning step can save you a lot of time, and eliminate boring, repetitive tasks to make your data scientists happier.

Click through for Vincent’s thoughts and recommendations.

Comments closed

Survival Analysis Model Explanations with survex

Mikolaj Spytek promotes an R package:

You can learn about it in this blog, but long story short, survival models (most often) predict a survival function. It tells us what is the probability of an event not happening until a given time t. The output can also be a single value (e.g., risk score) but these scores are always some aggregates of the survival function and this naturally leads to a loss of information included in the prediction.

The complexity of the output of survival models means that standard explanation methods cannot be applied directly.

Because of this, we (I and the team: Mateusz KrzyzińskiHubert Baniecki, and Przemyslaw Biecek) developed an R package — survex, which provides explanations for survival models. We hope this tool allows for more widespread usage of complex machine learning survival analysis models. Until now, simpler statistical models such as Cox Proportional Hazards were preferred due to their interpretability — vital in areas such as medicine, even though they were frequently outperformed by complex machine learning models.

Read on to dive into the topic. H/T R-Bloggers.

Comments closed

Network Analysis in R via netUtils

David Schoch has an R package for us:

During the last 5 years, I have accumulated various scripts with (personal) convenience functions for network analysis and I also implemented new methods from time to time which I could not find in any other package in R. The package netUtils gathers all these functions and makes them available for anyone who may also needs to apply “non-standard” network analytic tools. In this post, I will briefly highlight some of the most prominent functions of the package. All available functions are listed in the README on github.

Click through to see what’s available in the package. H/T R-Bloggers.

Comments closed

What’s in a Name?

Benjamin Smith analyzes a name change:

Recently, RStudio announced its name change to Posit. For many this name change was accepted with open arms, but for some-not so. Being the statistician that I am I decided to post a poll on LinkedIn to see the sentiment of my network. After running the poll for a week the results were in:

Read on for the responses as well as an analysis using RSTAN.

Comments closed

Anomaly Detection over Delta Live Tables

Avinash Sooriyarachchi and Sathish Gangichetty show off an interesting scenario:

Anomaly detection poses several challenges. The first is the data science question of what an ‘anomaly’ looks like. Fortunately, machine learning has powerful tools to learn how to distinguish usual from anomalous patterns from data. In the case of anomaly detection, it is impossible to know what all anomalies look like, so it’s impossible to label a data set for training a machine learning model, even if resources for doing so are available. Thus, unsupervised learning has to be used to detect anomalies, where patterns are learned from unlabelled data.

Even with the perfect unsupervised machine learning model for anomaly detection figured out, in many ways, the real problems have only begun. What is the best way to put this model into production such that each observation is ingested, transformed and finally scored with the model, as soon as the data arrives from the source system? That too, in a near real-time manner or at short intervals, e.g. every 5-10 minutes? This involves building a sophisticated extract, load, and transform (ELT) pipeline and integrating it with an unsupervised machine learning model that can correctly identify anomalous records. Also, this end-to-end pipeline has to be production-grade, always running while ensuring data quality from ingestion to model inference, and the underlying infrastructure has to be maintained.

Click through to see their solution using Databricks and delta lake.

Comments closed

Understanding Decision Trees

Durgesh Gupta provides a primer on the humble decision tree:

A decision tree is a graphical representation of all possible solutions to a decision.

The objective of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from training data.

It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

The way I like to describe decision trees, especially to developers, is that a tree is a set of if-else statements which leads to a conclusion. The nice part about decision trees is that once you understand how they work, you’re halfway there to gradient boosting (e.g., XGBoost) and random forests.

Comments closed

Worrying over Columns or Rows

John Mount explains an attitude difference:

I say: if you are a data scientist or working on an analytics projectworry over columns not rows.

In analytics “rows” are instances, and “columns” are possible measurements. For example: each click on a website might generate a row recording the visit, and this row would be populated with columns describing what was clicked on (and if you are lucky there are more records recording what else was presented and not clicked on).

Read the whole thing. This is also why formats like Parquet and ORC are so popular for data analysis. Same goes for business intelligence people, who reason mostly over columns, leading to columnstore indexes being so useful.

Comments closed

Extracting Numbers from a Stacked Density Plot

Derek Jones digs into an image:

A month or so ago, I found a graph showing a percentage of PCs having a given range of memory installed, between March 2000 and April 2020, on a TechTalk page of PC Matic; it had the form of a stacked density plot. This kind of installed memory data is rare, how could I get the underlying values (a previous post covers extracting data from a heatmap)?

Read on for an interesting attempt at reverse-engineering the original numbers used to create an image. H/T R-Bloggers.

Comments closed

Mapping Income vs Rent in Counties

Rick Pack updates a package to support a project:

I am happy to announce a contribution to the biscale package that makes printing shorter labels using SI prefixes (e.g., 1,000,003 => 1M and 1,324 => 1.3k) far easier. This makes printing the legend in an attractive easier, although you can tell by the picture above that I still struggle with optimal uses of the cowplot package’s draw_plot(). I would love for the legend and map to be centered under the title.

The new si_levels argument for bi_class_breaks() takes a logical value of TRUE or FALSE for either a single or two-unit vector, with a single unit vector causing the specified value to be applied to both the X and Y variables. This matches Prener’s convenient functionality for the number of digits function dig_lab, as he requested in the Github Issue I created for this addition. Note that si_levels rounds the input number, if appropriate, based on the digits indicated by dig_lab, which defaults to 3.

Click through to get access to the update, as well as to see some of the visuals Rick put together with it.

1 Comment