Press "Enter" to skip to content

Category: Data Science

Advice For A Budding Data Scientist

Charles Parker riffs off of an Edsger Dijkstra note:

It’s still early days for machine learning. The bounds and guidelines about what is possible or likely are still unknown in a lot of places, and bigger projects that test more of those limitations are more likely to fail. As a fledgling data engineer, especially in the industry, it’s almost certainly the more prudent course to go for the “low-hanging fruit” — easy-to-find optimizations that have real world impact for your organization. This is the way to build trust among skeptical colleagues and also the way to figure out where those boundaries are, both for the field and for yourself.

As a personal example, I was once on a project where we worked with failure data from large machines with many components. The obvious and difficult problem was to use regression analysis to predict the time to failure for a given part. I had some success with this, but nothing that ever made it to production. However, a simple clustering analysis that grouped machines by the frequency of replacement for all parts had some lasting impact; this enabled the organization to “red flag” machines that fell into “high replacement” group where the users may have been misusing the machines and bring these users in for training.

There’s some good advice.  Also read the linked Dijkstra note; even in bullet point form, he was a brilliant guy.

Comments closed

ML Algorithm Cheat Sheet

Hui Li has a quick cheat sheet on which algorithms might be useful in a particular situation:

A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including:

  • The size, quality, and nature of data.
  • The available computational time.
  • The urgency of the task.
  • What you want to do with the data.

Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. We are not advocating a one and done approach, but we do hope to provide some guidance on which algorithms to try first depending on some clear factors.

Hui then goes into detail on each. h/t Vincent Granville

Comments closed

Data Science Resources

Steph Locke has some resources if you are interested in getting started with data science:

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data is written by Hadley Wickham and Garett Grolemund. You can buy it and you can also access it online.

If you’re interested in learning to actually start doing data science as a practitioner, this book is a very accessible introduction to programming.

Starting gently, this book doesn’t teach you much about the use of R from a general programming perspective. It takes a very task oriented approach and teaches you R as you go along.

This book doesn’t cover the breadth and depth of data science in R, but it gives you a strong foundation in the coding skills you need and gives you a sense of the of the process you’ll go through.

It’s a good starting set of links.

Comments closed

Survival Analysis

Joseph Rickert explains what survival analysis is and shows an example with R:

Looking at the Task View on a small screen is a bit like standing too close to a brick wall – left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. (I would be remiss not to mention that we all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.) Newcomers, people either new to R or new to survival analysis or both, must find it overwhelming. So, it is with newcomers in mind that I offer the following slim trajectory through the task view that relies on just a few packages: survival, KMsurv, Oisurv and ranger

The survival package, which began life as an S package in the late ’90s, is the cornerstone of the entire R Survival Analysis edifice. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R.

Survival analysis is an interesting field of study.  In engineering fields, the most common use is calculating mean time to failure, but that’s certainly not the only place you’re liable to see it.

Comments closed

Outliers In Histograms

Edwin Thoen has an interesting solution to a classic problem with histograms:

Two strategies that make the above into something more interpretable are taking the logarithm of the variable, or omitting the outliers. Both do not show the original distribution, however. Another way to go, is to create one bin for all the outlier values. This way we would see the original distribution where the density is the highest, while at the same time getting a feel for the number of outliers. A quick and dirty implementation of this would be

hist_data %>% mutate(x_new = ifelse(x > 10, 10, x)) %>% ggplot(aes(x_new)) + geom_histogram(binwidth = .1, col = "black", fill = "cornflowerblue")

Edwin then shows a nicer solution, so read the whole thing.

Comments closed

The Birthday Problem

Mala Mahadevan explains the Birthday problem and demonstrates it with SQL and R:

Given a room of 23 random people, what are chances that two or more of them have the same birthday? 

This problem is a little different from the earlier ones, where we actually knew what the probability in each situation was.

What are chances that two people do NOT share the same birthday? Let us exclude leap years for now..chances that two people do not share the same birthday is 364/365, since one person’s birthday is already a given. In a group of 23 people, there are 253 possible pairs (23*22)/2. So the chances of no two people sharing a birthday is 364/365 multiplied 253 times. The chances of two people sharing a birthday, then, per basics of probability, is 1 – this.

The funny thing for me is that I’ve had the Birthday problem explained three separate times using as a demo the 20-30 people in the classroom.  In none of those three cases was there a match, so although I understand that it is correct and how it is correct, the 100% failure to replicate led a little nagging voice in the back of my mind to discount it.

Comments closed

Statistics For Programmers

Julia Evans shares some good resources for developers interested in statistics:

There are a lot of good links in Julia’s post.  I should also mention that Andrew Gelman and Deborah Nolan have a new book coming out in July.  Gelman’s Bayesian approach suits me well, so I’m pre-ordering the book.

Comments closed

Building A Recommendation System With Graph Data

Arvind Shyamsundar and Shreya Verma show how to implement a recommender system using SQL Server 2017’s new graph database functionality:

As you can see from the animation, the algorithm is quite simple:

  • First, we identify the user and ‘current’ song to start with (red line)
  • Next, we identify the other users who have also listened to this song (green line)
  • Then we find the other songs which those other users have also listened to (blue, dotted line)
  • Finally, we direct the current user to the top songs from those other songs, prioritized by the number of times they were listened to (this is represented by the thick violet line.)

The algorithm above is quite simple, but as you will see it is quite effective in meeting our requirement. Now, let’s see how to actually implement this in SQL Server 2017.

Click through for animated images as well as an actual execution plan and recommendations for graph query optimization (spoilers:  columnstore all the things).  They also link to the GitHub project where you can try it out yourself.

Comments closed

Instantiating Cortana Intelligence Gallery Solutions

Melissa Coates has a step-by-step guide showing how to install a solution from the Cortana Intelligence Gallery:

We had no options along the way for selecting names for resources, so we have a lot of auto-generated suffixes for our resource names. This is ok for purely learning scenarios, but not my preference if we’re starting a true project with a pre-configured solution. Following an existing naming convention is impossible with solutions (at this point anyway). A wish list item I have is for the solution deployment UI to display the proposed names for each resource and let us alter if desired before the provisioning begins.

The deployment also doesn’t prompt for which subscription to deploy to (if you have multiple subscriptions like I do). The deployment did go to the subscription I wanted, however, it would be really nice to have that as a selection to make sure it’s not just luck.

It sounds like there are some undesirable defaults, but at least it does appear to be very easy to do.

Comments closed