Press "Enter" to skip to content

Category: Data Science

The Process Of Processing Data

I continue my series on launching a data science project:

This next category of data cleansing has to do with specific values.  I want to look at three particular sub-categories:  mislabeled data, mismatched data, and incorrect data.

Mislabeled data happens when the label is incorrect.  In a data science problem, the label is the thing that we are trying to explain or predict.  For example, in our data set, we want to predict SalaryUSD based on various inputs.  If somebody earns $50,000 per year but accidentally types 500000 instead of 50000, it can potentially affect our analysis.  If you can fix the label, this data becomes useful again, but if you cannot, it increases the error, which means we have a marginally lower capability for accurate prediction.

Mismatched data happens when we join together data from sources which should not have been joined together.  Let’s go back to the product title and UPC/MFC example.  As we fuss with the data to try to join together these two data sets, we might accidentally write a rule which joins a product + UPC to the wrong product + MFC.  We might be able to notice this with careful observation, but if we let it through, then we will once again thwart reality and introduce some additional error into our analysis.  We could also end up with the opposite problem, where we have missed connections and potentially drop useful data out of our sample.

Finally, I’m calling incorrect data where something other than the label is wrong.  For example, in the data professional salary survey, there’s a person who works 200 hours per week.  While I admire this person’s dedication and ability to create 1.25 extra days per week that the rest of us don’t experience, I think that person should have held out for more than just $95K/year.  I mean, if I had the ability to generate spare days, I’d want way more than that.

In this series, I’ve found myself writing a bit more than expected, so I’m breaking out theory from implementation.  This is the theory post, with implementation coming next week.

Comments closed

Image Recognition Using Viola-Jones

Ellen Talbot lays out some of the basics of image recognition:

Aggregate channel features (ACF) is a variation of channel features, which extracts features directly as pixel values in extended channels without computing rectangular sums at various locations and scales.

Common channels include the colour channels, such as grey-scale and RBG, but many other channels can be encoded, depending on the difficulty of your problem (e.g. gradient magnitude and gradient histograms).

ACF has advantages, such as a richer representation, accelerated detection speed and more accurate localisation of objects in the images when used in conjunction with a boosting method.

Click through for more, including a few resources around the Viola-Jones algorithm.

Comments closed

Starting A Data Science Project: Business Understanding

I continue my data science project series:

As you listen to these types of questions, your goal is to nail down a specific problem with a specific answer.  You want to narrow down the scope to something that your team can achieve, ideally something with a built-in measure for success.  For example, here are a few specific problems that we could go solve:

  • Find a model which predicts quarterly sales to within 5% no later than 30 days into the quarter.
  • Given a title and description for a product, tell me a listing category which Amazon will, with at least 90% confidence, consider valid for this product.
  • Determine the top three factors which most affect the number of years the first owner holds onto our mid-range sedan.

With a specific problem in mind, you can look for relevant data.  Of course, you’ll probably need to modify the scope of this problem over time as you gather new information, but this gives you a starting point for success.  Also, don’t expect something as clear-cut as the above early on; instead, people will hem and haw, not quite sure what they really want.  You can take a fuzzy goal into data acquisition, but as you acquire data, you will want to work with the champion to focus down to a targeted and valuable problem.

Read on for several references to big sacks of cash.  After becoming a manager, I’ve become much more attuned to the idea of receiving big sacks of cash.

Comments closed

Reviewing The Team Data Science Process

I am starting a new series on launching a data science project, and my presentation quickly veers into a pessimistic place:

The concept of “clean” data is appealing to us—I have a talk on the topic and spend more time than I’m willing to admit trying to clean up data.  But the truth is that, in a real-world production scenario, we will never have truly clean data.  Whenever there is the possibility of human interaction, there is the chance of mistyping, misunderstanding, or misclicking, each of which can introduce invalid results.  Sometimes we can see these results—like if we allow free-form fields and let people type in whatever they desire—but other times, the error is a bit more pernicious, like an extra 0 at the end of a line or a 10-key operator striking 4 instead of 7.

Even with fully automated processes, we still run the risk of dirty data:  sensors have error ranges, packets can get dropped or sent out of order, and services fail for a variety of reasons.  Each of these can negatively impact your data, leaving you with invalid entries.

Read on for a few more adages which shape the way we work on projects, followed by an overview of the Microsoft Team Data Science Process.

Comments closed

Methods To Improve Model Accuracy

Tristan Robinson shows how to go back to the drawing board when your model’s accuracy isn’t cutting it:

One of the reoccurring principles that appears with machine learning is that of Ockham’s razor, which states that the best models are simple models that fit the data well; this is not an irrefutable principle of logic, but a preference for simplicity. Therefore there is a need of balance between accuracy and simplicity to limit the feature set which tends to lead to better predictions. Simpler models are also more interpretable to humans which also helps. While the data I was working with was limited to around 35 features, there are many data science problems which have thousands of features and so this technique is even more crucial.

There are multiple methods to perform feature selection, of which a few will be covered here. The first method is greedy backward selection which starts with all the features and then finds the feature that hurts predictive power the least when removed, and you remove it. This is done iteratively until a point is met (which will be discussed later). Its known as greedy since it never looks back after removing the feature each time.

An alternative method is greedy forward selection which is basically the inverse, starts with no features, and looks for the feature that by itself is the best model. This then carries on in a similar vein to the backward selection but adding features. The point at which you stop with forward selection is that of diminishing returns for your accuracy.

Read the whole thing.  This is explanation rather than demonstration, but the explanation applies to pretty much any implementation you’re using.

Comments closed

JupyterLab Now Available

Project Jupyter announces the general availability of JupyterLab:

JupyterLab is an interactive development environment for working with notebooks, code and data. Most importantly, JupyterLab has full support for Jupyter notebooks. Additionally, JupyterLab enables you to use text editors, terminals, data file viewers, and other custom components side by side with notebooks in a tabbed work area.

JupyterLab provides a high level of integration between notebooks, documents, and activities:

  • Drag-and-drop to reorder notebook cells and copy them between notebooks.

  • Run code blocks interactively from text files (.py, .R, .md, .tex, etc.).

  • Link a code console to a notebook kernel to explore code interactively without cluttering up the notebook with temporary scratch work.

  • Edit popular file formats with live preview, such as Markdown, JSON, CSV, Vega, VegaLite, and more.

I like this, as I’m a big fan of notebooks but sometimes you just want to write some diagnostic queries and an IDE is way better for that. H/T Giovanni Lanzani

Comments closed

The Basics Of PCA In R

Prashant Shekhar gives us an overview of Principal Component Analysis using R:

PCA changes the axis towards the direction of maximum variance and then takes projection on this new axis. The direction of maximum variance is represented by Principal Components (PC1). There are multiple principal components depending on the number of dimensions (features) in the dataset and they are orthogonal to each other. The maximum number of principal component is same as a number of dimension of data. For example, in the above figure, for two-dimension data, there will be max of two principal components (PC1 & PC2). The first principal component defines the most of the variance, followed by second principal component, third principal component and so on. Dimension reduction comes from the fact that it is possible to discard last few principal components as they will not capture much variance in the data.

PCA is a useful technique for reducing dimensionality and removing covariance.

Comments closed

Investigating The gcForest Algorithm

William Vorhies describes a new algorithm with strong potential:

gcForest (multi-Grained Cascade Forest) is a decision tree ensemble approach in which the cascade structure of deep nets is retained but where the opaque edges and node neurons are replaced by groups of random forests paired with completely-random tree forests.  In this case, typically two of each for a total of four in each cascade layer.

Image and text problems are categorized as ‘feature learning’ or ‘representation learning’ problems where features are neither predefined nor engineered as in traditional ML problems.  And the basic rule in these feature discovery problems is to go deep, using multiple layers each of which learns relevant features of the data in order to classify them.  Hence the multi-layer structure so familiar with DNNs is retained.

By using both random forests and completely-random tree forests the authors gain the advantage of diversity.  Each forest contains 500 completely random trees allowed to split until each leaf node contains only the same class of instances making the growth self-limiting and adaptive, unlike the fixed and large depth required by DNNs.

The estimated class distribution forms a class vector which is then concatenated with the original feature vector to be the input of the next level cascade.  Not dissimilar from CNNs.

The final model is a cascade of cascade forests.  The final prediction is obtained by aggregating the class vectors and selecting the class with the highest maximum score.

Click through for how it fares on the normal sample data sets like MNIST.

Comments closed

DataExplorer

Boxuan Cui introduces DataExplorer, an R package dedicated to assist with exploratory data analysis:

According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.

Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any data.frame-like objects. However, certain functions require a data.table class object as input due to the update-by-reference feature, which I will cover in later part of the post.

For my money, that number is closer to 90%.  I will have to check this package out.

Comments closed

The Year Of The Data Engineer

Alex Woodie points out that data science also requires data engineers:

The shortage of data scientists – those triple-threat types who possess advanced statistics, business, and coding skills – has been well-documented over the years. But increasingly, businesses are facing a shortage of another key individual on the big data team who’s critical to achieving success – the data engineer.

Data engineers are experts in designing, building, and maintaining the data-based systems in support of an organization’s analytical and transactional operations. While they don’t boast the quantitative skills that a data scientist would use to, say, build a complex machine learning model, data engineers do much of the other work required to support that data science workload, such as:

  • Building data pipelines to collect data and move it into storage;

  • Preparing the data as part of an ETL or ELT process;

  • Stitching the data together with scripting languages;

  • Working with the DBA to construct data stores;

  • Ensuring the data is ready for use;

  • Using frameworks and microservices to serve data.

Read the whole thing.  My experience is that most shops looking to hire a data scientist really need to get data engineers first; otherwise, you’re wasting that high-priced data scientist’s time.  The plus side is that if you’re already a database developer, getting into data engineering is much easier than mastering statistics or neural networks.

Comments closed