Press "Enter" to skip to content

Category: Data Science

Understanding ROC Curves

Bob Horton explains ROC curves and shows how to create them in R:

ROC curves are commonly used to characterize the sensitivity/specificity tradeoffs for a binary classifier. Most machine learning classifiers produce real-valued scores that correspond with the strength of the prediction that a given case is positive. Turning these real-valued scores into yes or no predictions requires setting a threshold; cases with scores above the threshold are classified as positive, and cases with scores below the threshold are predicted to be negative. Different threshold values give different levels of sensitivity and specificity. A high threshold is more conservative about labelling a case as positive; this makes it less likely to produce false positive results but more likely to miss cases that are in fact positive (lower rate of true positives). A low threshold produces positive labels more liberally, so it is less specific (more false positives) but also more sensitive (more true positives). The ROC curve plots true positive rate against false positive rate, giving a picture of the whole spectrum of such tradeoffs.

ROC curves are one of the primary techniques for figuring out if a binary classifier “works.”

Comments closed

SQL Saturday Attendee Distances

I have a long, long post on figuring out how far SQL Saturday attendees tend to drive:

Before I begin, allow me to perform the data science Airing of Grievances.  This is an important part of data analysis which most people gloss over, instead jumping right into the “clean up the dirty data” phase.  But no, let’s revel in its filth for just a few moments.

Despite my protestations and complaints, I think there are some reasonable conclusions.  If you need to look like you’re working for a couple of hours (or at least want to play around a bit with SQL and R), this is the post for you.

Comments closed

Intro To Data Analytics

Stacia Varga has a follow-up to her Gentle Introduction to Data Analytics:

Q: How is data analytics on SAS different from R support for SQL2016?

A: I’ve not used SAS (although I did go to their campus to teach MDX to their engineers once upon a time… beautiful campus!), so I can’t say with any specificity. SAS also has multiple products so it would be difficult to describe differences. In general – and please understand there may be something new in SAS that I don’t know about – I think the difference is that R in SQL 2016 allows you to run functions on data inside the database, thereby leveraging database resources and also server resources for parallelization. I’d love to get input from readers to expand on this topic.

I can confirm that SAS has a beautiful campus.

Comments closed

Traveling Salesmen

Jesse Clark explains the traveling salesman problem:

One of the canonical questions in operations is the traveling salesman problem (TSP). In its simplest form, we have a busy salesperson who must visit a set number of locations once. Time is money, so the salesperson wants to choose a route that minimizes the total distance traveled. It is not so hard to imagine these path optimization problems occurring within warehouses where people (‘pickers’) need to navigate aisles and fill orders as they go.

The Traveling Salesman Problem is a computer science classic and acts as a classic graph optimization problem.  Check this post out for more details.

Comments closed

Range And Variance

Mala Mahadevan looks at calculating range, variance, and standard deviation in R and T-SQL:

The first and most common measure of dispersion is called ‘Range‘. The range is just the difference between the maximum and minimum values in the dataset. It tells you how much gap there is between the two and therefore how wide the dataset is in terms of its values. It is however, quite misleading when you have outliers in the data. If you have one value that is very large or very small that can skew the Range and does not really mean you have values spanning the minimum to the maximum.

To lower this kind of an issue with outliers – a second variation of the range called Inter-Quartile Range, or IQR is used. The IQR is calculated by dividing the dataset into 4 equal parts after sorting the said value in ascending order. For the first and third part, the maximum values are taken and then subtracted from each other. The IQR ensures that you are looking at top and near-bottom ranges and therefore the value it gives is probably spanning the range.

Just like her previous post, this one also includes an example built for SQL Server R Services.

Comments closed

Storytelling With Data

Vik Paruchuri walks through exploratory data analysis using New York City schools data:

Heatmaps are good for mapping out gradients, but we’ll want something with more structure to plot out differences in SAT score across the city. School districts are a good way to visualize this information, as each district has its own administration. New York City has several dozen school districts, and each district is a small geographic area.

We can compute SAT score by school district, then plot this out on a map. In the below code, we’ll:

  • Group full by school district.

  • Compute the average of each column for each school district.

  • Convert the school_dist field to remove leading 0s, so we can match our geograpghic district data.

Also check out part 1 if you missed it.

Comments closed

Clustering With Spark

Konur Unyelioglu shows how to implement k-means and Guassian clustering techniques in Apache Spark using MLlib:

Clustering is the task of assigning entities into groups based on similarities among those entities. The goal is to construct clusters in such a way that entities in one cluster are more closely related, i.e. similar to each other than entities in other clusters. As opposed to classification problems where the goal is to learn based on examples, clustering involves learning based on observation. For this reason, it is a form of unsupervised learning task.

There are many different clustering algorithms and a central notion in all of those is the definition of ’similarity’ between the entities that are being grouped. Different clustering algorithms may have different ways of measuring the similarity. In many clustering algorithms, another common notion is the so-called cluster center, which is a basis to represent the cluster. For example, in K-means clustering algorithm, the cluster center is the arithmetic mean position of all the points in that cluster.

This is a fairly lengthy article but if you want to get into machine learning with Spark, it’s a good one.

Comments closed

Data Cleansing Outside Of Excel

Lee Baker shows some free alternatives to Excel for data cleansing:

Another issue is that some Excel functions operate on selected data, whereas others act on the whole worksheet. If you select a column of data and use Find to identify certain characters, it will identify only those characters in your chosen column. If you now use Replace it will change all such characters in the entire worksheet – which is probably not what you wanted to do, and you may have unwittingly introduced new errors into your data without being aware of it.

The safest way to clean your data in Excel is to copy an individual column to a separate worksheet, perform all your cleaning operations in isolation until you’re happy with the result, then copy your cleaned data to your original sheet (or better still, to a new sheet that stores only clean data). The repeated use of Copy, Paste and using multiple worksheets to clean your data can become extremely messy.

Lee recommends three free tools, and they look like they’re worth trying out.

Comments closed

StackLite Dataset

David Robinson reports on a new Stack Exchange data set available to the public:

For each Stack Overflow question asked since the beginning of the site, the dataset includes:

  • Question ID
  • Creation date
  • Closed date, if applicable
  • Deletion date, if applicable
  • Score
  • Owner user ID (except for deleted questions)
  • Number of answers
  • Tags

This is ideal for performing analyses such as:

  • The increase or decrease in questions in each tag over time

  • Correlations among tags on questions

  • Which tags tend to get higher or lower scores

  • Which tags tend to be asked on weekends vs weekdays

  • Rates of question closure or deletion over time

  • The speed at which questions are closed or deleted

This is pretty exciting.  Getting good, high-quality data sets for demonstration and pedagogical purposes is time-consuming, so the fact that the Stack Exchange people are tossing one out our way could be a major time-saver.

Comments closed