Press "Enter" to skip to content

Category: Data Science

Team Data Science Process Updates

David Smith announces updates to the Team Data Science Process:

It’s been over a year since we first introduced introduced the Team Data Science Process (TDSP). The data, technology and practices behind Data Science continue to evolve, and the TDSP has evolved in parallel. Over the past year, several new facets have been added, including:

  • The IDEAR (Interactive Data Exploration, Analysis and Reporting) framework, an open source extension to R and Python designed to standardize the process of data exploration and reporting;

  • Guidance for use of Spark 2.0, including an end-to-end Spark v2.0 walkthrough;

  • Guidance for use of in-database Python with SQL Server, including an end-to-end in-database Python tutorial;

Click through for more changes, as well as links to further resources.

Comments closed

Text Preprocessing With R

Sibanjan Das has started a new series on text mining in R:

Next, we need to preprocess the text to convert it into a format that can be processed for extracting information. It is essential to reduce the size of the feature space before analyzing the text. There are various preprocessing methods that we can use here, such as stop word removal, case folding, stemming, lemmatization, and contraction simplification. However, it is not necessary to apply all of the normalization methods to the text. It depends on the data we retrieve and the kind of analysis to be performed.

The series starts off with a quick description of some preprocessing steps and then building an LDA model to extract key terms from articles.

Comments closed

Kaggle Data Science Report For 2017

Mark McDonald rounds up a few notebooks covering a recent Kaggle survey:

In 2017 we conducted our first ever extra-large, industry-wide survey to captured the state of data science and machine learning.

As the data science field booms, so has our community. In 2017 we hit a new milestone of reaching over 1M registered data scientists from almost every country in the world. Representing many different backgrounds, skill levels, and professions, we were excited to ask our community a wide range of questions about themselves, their skills, and their path to data science. We asked them everything from “what’s your yearly salary?” to “what’s your favorite data science podcasts?” to “what barriers are faced at work?”, letting us piece together key insights about the people and the trends behind the machine learning models.

Without further ado, we’d love to share everything with you. Over 16,000 responses surveys were submitted, with over 6 full months of aggregated time spent completing it (an average response time of more than 16 minutes).

Click through for a few reports.  Something interesting to me is that the top languages/tools were, in order, Python, R, and SQL.  For the particular market niche that Kaggle competitions fit, that makes a lot of sense:  I tend to like R more for data exploration and data cleansing, but much of that work is already done by the time you get the dataset.

Comments closed

Dealing With Word Tensors

Chris Moody continues his series on natural language processing:

Counting and tensor decompositions are elegant and straightforward techniques. But these methods are grossly underepresented in business contexts. In this post we factorized an example made up of word skipgrams occurring within documents to arrive at word and document vectors simultaneously. This kind of analysis is effective, simple, and yields powerful concepts.

Look to your own data, and before throwing black-box deep learning machines at them, try out tensor factorizations!

He has a set of animated GIFs to help with learning, though I do wish they were about 30% slower so you can take a moment to read each section before it jumps to the next bit.

Comments closed

Data Set Robustness

Tomaz Kastrun shows how robust the iris data set is:

Conclusion, IRIS dataset is – due to the nature of the measurments and observations – robust and rigid; one can get very good accuracy results on a small training set. Everything beyond 30% for training the model, is for this particular case, just additional overload.

The general concept here is, how small can you arbitrarily slice the data and still come up with the same result as the overall data set?  Or, phrased differently, how much data do you need to collect before predictions stabilize?  Read on to see how Tomaz solves the problem.

Comments closed

Online Learning Algorithms

Xin Hunt describes the benefits of online learning algorithms:

A few examples of classical online learning algorithms include recursive least squares, stochastic gradient descent and multi-armed bandit algorithms like Thompson sampling. Many online algorithms (including recursive least squares and stochastic gradient descent) have offline versions. These online algorithms are usually developed after the offline version, and are designed for better scaling with large datasets and streaming data. Algorithms like Thompson sampling on the other hand, do not have offline counterparts, because the problems they solve are inherently online.

Let’s look at interactive ad recommendation systems as an example. You’ll find ads powered by these systems when you browse popular publications, weather sites and social media networks. These recommendation systems build customer preference models by tracking your shopping and browsing activities (ad clicking, wish list updates and purchases, for example). Due to the transient nature of shopping behaviors, new recommendations must reflect the most recent activities. This makes online learning a natural choice for these systems.

My favorite online learning algorithm at the moment is Online Passive-Aggressive Algorithms.  Not just because that name describes my Twitter feed.

1 Comment

Installing The Azure ML Workbench

Leila Etaati walks us through setting up the Azure ML workbench:

In Microsoft ignite 2017, Azure ML team announce new on-premises tools for doing machine learning. this tools much more comprehensive as it provides

1- a workspace helps data wrangling

2- Data Visualization

3-Easy to deploy

4-Support Python codes

in this post and next posts, I will share my experiment with working this tools.

Click through for the step-by-step installation guide.

Comments closed

Unsupervised Decision Trees

William Vorhies describes what unsupervised decision trees are:

In anomaly detection we are attempting to identify items or events that don’t match the expected pattern in the data set and are by definition rare.  The traditional ‘signature based’ approach widely used in intrusion detection systems creates training data that can be used in normal supervised techniques.  When an attack is detected the associated traffic pattern is recorded and marked and classified as an intrusion by humans.  That data then combined with normal data creates the supervised training set.

In both supervised and unsupervised cases decision trees, now in the form of random forests are the weapon of choice.  Decision trees are nonparametric; they don’t make an assumption about the distribution of the data.  They’re great at combining numeric and categoricals, and handle missing data like a champ.  All types of anomaly data tend to be highly dimensional and decision trees can take it all in and offer a reasonably clear guide for pruning back to just what’s important.

To be complete, there is also category of Semi-Supervised anomaly detection in which the training data consists only of normal transactions without any anomalies.  This is also known as ‘One Class Classification’ and uses one class SVMs or autoencoders in a slightly different way not discussed here.

Interesting reading.  I’d had no idea that unsupervised decision trees were even a thing.

Comments closed

Stop Using word2vec

Chris Moody wants you to stop using word2vec:

When I started playing with word2vec four years ago I needed (and luckily had) tons of supercomputer time. But because of advances in our understanding of word2vec, computing word vectors now takes fifteen minutes on a single run-of-the-mill computer with standard numerical libraries. Word vectors are awesome but you don’t need a neural network – and definitely don’t need deep learning – to find them. So if you’re using word vectors and aren’t gunning for state of the art or a paper publication then stop using word2vec.

Chris has a follow-up post on word tensors as well:

There’s only three steps to computing word tensors. Counting word-word-document skipgrams, normalizing those counts to form the PMI-like M tensor and then factorizing M into smaller matrices.

But to actually perform the factorization we’ll need to generalize the SVD to higher rank tensors 1. Unfortunately, tensor algebra libraries aren’t very common 2. We’ve written one for non-negative sparse tensor factorization, but because the PMI can be both positive and negative it isn’t applicable here. Instead, for this application I’d recommend HOSVD as implemented in scikit-tensor. I’ve also heard good things about tensorly.

I’m going to keep using word2vec for now, but it’s a good pair of posts.

Comments closed

Unintentional Data

Eric Hollingsworth describes data science as the cost of collecting data approaches zero:

Thankfully not only have modern data analysis tools made data collection cheap and easy, they have made the process of exploratory data analysis cheaper and easier as well. Yet when we use these tools to explore data and look for anomalies or interesting features, we are implicitly formulating and testing hypotheses after we have observed the outcomes. The ease with which we are now able to collect and explore data makes it very difficult to put into practice even basic concepts of data analysis that we have learned — things such as:

  • Correlation does not imply causation.
  • When we segment our data into subpopulations by characteristics of interest, members are not randomly assigned (rather, they are chosen deliberately) and suffer from selection bias.
  • We must correct for multiple hypothesis tests.
  • We ought not dredge our data.

All of those principles are well known to statisticians, and have been so for many decades. What is newer is just how cheap it is to posit hypotheses. For better and for worse, technology has led to a democratization of data within organizations. More people than ever are using statistical analysis packages and dashboards, explicitly or more often implicitly, to develop and test hypotheses.

This is a thoughtful essay well worth reading.

Comments closed