Before I begin, allow me to perform the data science Airing of Grievances. This is an important part of data analysis which most people gloss over, instead jumping right into the “clean up the dirty data” phase. But no, let’s revel in its filth for just a few moments.
Despite my protestations and complaints, I think there are some reasonable conclusions. If you need to look like you’re working for a couple of hours (or at least want to play around a bit with SQL and R), this is the post for you.
Q: How is data analytics on SAS different from R support for SQL2016?
A: I’ve not used SAS (although I did go to their campus to teach MDX to their engineers once upon a time… beautiful campus!), so I can’t say with any specificity. SAS also has multiple products so it would be difficult to describe differences. In general – and please understand there may be something new in SAS that I don’t know about – I think the difference is that R in SQL 2016 allows you to run functions on data inside the database, thereby leveraging database resources and also server resources for parallelization. I’d love to get input from readers to expand on this topic.
I can confirm that SAS has a beautiful campus.
One of the canonical questions in operations is the traveling salesman problem (TSP). In its simplest form, we have a busy salesperson who must visit a set number of locations once. Time is money, so the salesperson wants to choose a route that minimizes the total distance traveled. It is not so hard to imagine these path optimization problems occurring within warehouses where people (‘pickers’) need to navigate aisles and fill orders as they go.
The Traveling Salesman Problem is a computer science classic and acts as a classic graph optimization problem. Check this post out for more details.
The first and most common measure of dispersion is called ‘Range‘. The range is just the difference between the maximum and minimum values in the dataset. It tells you how much gap there is between the two and therefore how wide the dataset is in terms of its values. It is however, quite misleading when you have outliers in the data. If you have one value that is very large or very small that can skew the Range and does not really mean you have values spanning the minimum to the maximum.
To lower this kind of an issue with outliers – a second variation of the range called Inter-Quartile Range, or IQR is used. The IQR is calculated by dividing the dataset into 4 equal parts after sorting the said value in ascending order. For the first and third part, the maximum values are taken and then subtracted from each other. The IQR ensures that you are looking at top and near-bottom ranges and therefore the value it gives is probably spanning the range.
Just like her previous post, this one also includes an example built for SQL Server R Services.
Heatmaps are good for mapping out gradients, but we’ll want something with more structure to plot out differences in SAT score across the city. School districts are a good way to visualize this information, as each district has its own administration. New York City has several dozen school districts, and each district is a small geographic area.
We can compute SAT score by school district, then plot this out on a map. In the below code, we’ll:
Group full by school district.
Compute the average of each column for each school district.
Convert the school_dist field to remove leading 0s, so we can match our geograpghic district data.
Also check out part 1 if you missed it.
Clustering is the task of assigning entities into groups based on similarities among those entities. The goal is to construct clusters in such a way that entities in one cluster are more closely related, i.e. similar to each other than entities in other clusters. As opposed to classification problems where the goal is to learn based on examples, clustering involves learning based on observation. For this reason, it is a form of unsupervised learning task.
There are many different clustering algorithms and a central notion in all of those is the definition of ’similarity’ between the entities that are being grouped. Different clustering algorithms may have different ways of measuring the similarity. In many clustering algorithms, another common notion is the so-called cluster center, which is a basis to represent the cluster. For example, in K-means clustering algorithm, the cluster center is the arithmetic mean position of all the points in that cluster.
This is a fairly lengthy article but if you want to get into machine learning with Spark, it’s a good one.
Another issue is that some Excel functions operate on selected data, whereas others act on the whole worksheet. If you select a column of data and use Find to identify certain characters, it will identify only those characters in your chosen column. If you now use Replace it will change all such characters in the entire worksheet – which is probably not what you wanted to do, and you may have unwittingly introduced new errors into your data without being aware of it.
The safest way to clean your data in Excel is to copy an individual column to a separate worksheet, perform all your cleaning operations in isolation until you’re happy with the result, then copy your cleaned data to your original sheet (or better still, to a new sheet that stores only clean data). The repeated use of Copy, Paste and using multiple worksheets to clean your data can become extremely messy.
Lee recommends three free tools, and they look like they’re worth trying out.
For each Stack Overflow question asked since the beginning of the site, the dataset includes:
- Question ID
- Creation date
- Closed date, if applicable
- Deletion date, if applicable
- Owner user ID (except for deleted questions)
- Number of answers
This is ideal for performing analyses such as:
The increase or decrease in questions in each tag over time
Correlations among tags on questions
Which tags tend to get higher or lower scores
Which tags tend to be asked on weekends vs weekdays
Rates of question closure or deletion over time
The speed at which questions are closed or deleted
This is pretty exciting. Getting good, high-quality data sets for demonstration and pedagogical purposes is time-consuming, so the fact that the Stack Exchange people are tossing one out our way could be a major time-saver.
I’ve worked professionally with databases for a living for around 20 years, to varying degrees.
I’ve worked with words professionally for roughly the same amount of time (as an author, editor, and publisher).
It’s only natural that I’d be interested in ways of mashing them together.
Check it out.
Weighting is used to make sure samples reproduce the underlying characteristics of the population they are drawn from. For instance, in the UK 51.6% of voters are women, of these 22.5% are above 65 years of age, 23.1% have a higher education and so on. The people selected to make part of a sample may be recruited to ensure they match these proportions.
Within the context of weighting there are two major sampling procedures: Quota and stratified sampling. The relevant difference between them is the latter uses some sort of randomization device while with the former the interviewer decides whether or not to interview a person with certain characteristics. YouGov presumably applied Quota Sampling for the Brexit survey, and the gender, age and education weights they applied are shown in the following image.
Read the whole thing.