Press "Enter" to skip to content

Category: Data Science

Projections And Confidence Intervals

Mirio De Rosa explains confidence intervals and sampling issues, using polls as an example:

Weighting is used to make sure samples reproduce the underlying characteristics of the population they are drawn from. For instance, in the UK 51.6% of voters are women, of these 22.5% are above 65 years of age, 23.1% have a higher education and so on. The people selected to make part of a sample may be recruited to ensure they match these proportions.

Within the context of weighting there are two major sampling procedures: Quota and stratified sampling. The relevant difference between them is the latter uses some sort of randomization device while with the former the interviewer decides whether or not to interview a person with certain characteristics. YouGov presumably applied Quota Sampling for the Brexit survey[4], and the gender, age and education weights they applied are shown in the following image.

Read the whole thing.

Comments closed

Hack Those P Values!

Ned Bicare provides us a sure-fire method for getting our academic papers published:

“If you torture the data long enough, it will confess.”

This aphorism, attributed to Ronald Coase, sometimes has been used in a disrespective manner, as if it was wrong to do creative data analysis.
In fact, the art of creative data analysis has experienced despicable attacks over the last years. A small but annoyingly persistent group of second-stringers tries to denigrate our scientific achievements. They drag psychological science through the mire.
Ned has a great tool to play around with as well, letting us Statistics our way to academic success.
Comments closed

Analyze Fantasy Sports With Spark

Jordan Volz is back with part two of his series on fantasy sports analysis using Apache Spark:

We’ll look at both zTot and nTot, and consider the player’s age and experience.The latter is potentially important because there have been shifts in what ages players joined the league over the timespan we are considering. It used to be rare for players to skip college, then it wasn’t, now they are required to play at least one year. It will be interesting to see if we see a difference in age versus experience in the numbers.

We start with the RDD containing all the raw stats, z-scores, and normalized z-scores. Another piece of data to consider is how a player’s z-score and normalized z-score change each year, so we’ll calculate the change in both from year to year. We’ll save off two sets of data, one a key-value pair of age-values, and one a key-value pair of experience-values. (Note that in this analysis, we disregard all players who played in 1980, as we don’t have sufficient data to determine their experience level.)

Jordan also looks at player performance over time and makes data analysis look pretty easy.

Comments closed

Data Science At Stack Overflow

David Robinson discusses his role as a data scientist at Stack Overflow:

The most prominent example of where machine learning is used in our product is Providence; our system for matching users to jobs they’ll be interested in. (For example, if you visit mostly Python and Javascript questions on Stack Overflow, you’ll end up getting Python web development jobs as advertisements). I work with engineers on the Data team (Kevin Montrose,Jason Punyon, and Nick Larsen) to design, improve and implement these machine learning algorithms. (Here’s some more about the architecture of the system, built before I joined). For example, we’ve worked to get the balance right between jobs that are close to a user geographically and jobs that are well-matched in terms of technology, and ensuring that users get a variety of jobs rather than seeing the same ones over and over.

A lot of this process involves designing and analyzing A/B tests, particularly about changing our targeting algorithms, ad design, and other factors to improve clickthrough rate (CTR). This process is more statistically interesting than I’d expected, in some cases letting me find new uses for methods I’d used to analyze biological experiments, and in other cases encouraging me to learn new statistical tools. In fact, much of my series on applying Bayesian methods to baseball batting statistics is actually a thinly-veiled version of methods I’ve used to analyze CTR across ad campaigns.

Sounds like a fun place to be.

Comments closed

Data Science Languages

David Crook walks through his data science workflow and discusses language choice:

So I’ve spent a while now looking at 3 competing languages and I did my best to give each one a fair shake. Those 3 languages were F#, Python and R. I have to say it was really close for a while because each language has its strengths and weaknesses. That said, I am moving forward with 2 languages and a very specific way I use each one. I wanted to outline this, because for me it has taken a very long time to learn all of the languages to the level that I have to discover this and I would hate for others to go through the same exercise.

Read on for his decision, as well as how you go from “here’s some raw data” to “here are some services to expose interesting results.”

Comments closed

Box And Whisker Plots

Slava Murygin shows how to create a box and whisker plot in SSMS using spatial data types:

If you have no idea what Box-and-Whisker Plot is, please visit following link: http://www.wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-box-plots

At first, I will show how to do it based on AdventureWorks database in SQL Server 2014.

We will analyze amounts of Individual lines of Sales Orders within each month.

The first step is to create a Data Set to process.  That Data Set will contain a Month, Single Line amount and order number of that record within a month.

This is really cool…but I wonder if it wouldn’t be better to do this in R, where it’d take a lot less code.  If you can’t reach out to R, though, this is a good way of visualizing results.

Comments closed

Fallacies of Data Science

Adnan Masood and David Lazar have a list of fallacies in the world of data science:

Extrapolating beyond the range of training data, especially in the case of time series data, is fine providing the data-set is large enough.

Strong Evidence is same as a Proof! Prediction intervals and confidence intervals are the same thing, just like statistical significance and practical significance.

These are some good things to think about if you’re getting into analytics.

Comments closed

POCs As A Problem

Bill Vorhies argues that data science proofs of concept fall short of the mark:

If you do a quick read through of some of the Gartner or O’Reilly studies you’ll quickly see that a lack of executive sponsorship is one of the major barriers to adoption.  So isn’t the POC a good way to get the attention of the C-level?  Yes and no.

If as we described above it leads to the adoption of a series of stand alone ‘technology projects’, then no.  If it was really necessary to start with little firecracker POCs to demonstrate the explosive strategic value of becoming data-driven, then maybe so.

Here’s a simple change of mindset (borrowed from John Weathington referenced above) that instead of focusing on Proof of Concept, we should instead create projects to demonstrate Proof of Value.  By focusing on value we change the orientation so that any projects are aligned with value to the company.  In other words, they are aligned with the company’s strategic objectives.

This is an interesting argument which goes against my inclinations.  Check it out.

Comments closed

Cortana Intelligence Suite

Buck Woody discusses various components of the Cortana Intelligence Suite:

It’s not a simple matter of “choose one from column B and two from column A” – you have to learn the processes, and then the tools, and then think about your situation. In other words, some things are complicated because they are…complicated. However:

There are some things you can consider out of the box. So I spoke with my friend Romit Girdhar while we were co-teaching in London last week, and he put together a great visualization. You can see them here, and download the PDF below. Thanks, Romit!

And of course they had to change the name—it wouldn’t be a Microsoft product if the name didn’t change every six months…

Comments closed

Exploring Spark

Adnan Masood has photos of slides from a Spark-related meetup:

Apache Spark is a general purpose cluster computing platform which extends map-reduce to support multiple computation types including but not limited to stream processing and interactive queries. Last week IBM’s Moktar Kandil presented at the Tampa Hadoop and Tampa Data Science Group Joint meetup on the topic of exploring Apache Spark.

Apache Spark for Azure HD-Insight

Following are some of the slides discussed in the meetup. To play with the ALS Recommendation engine notebook, please register at www.datascientistworkbench.com which is a free notebook for Apache Spark platform for educational purposes.

Check out the links.

Comments closed