Ned Bicare provides us a sure-fire method for getting our academic papers published:
“If you torture the data long enough, it will confess.”
This aphorism, attributed to Ronald Coase, sometimes has been used in a disrespective manner, as if it was wrong to do creative data analysis.In fact, the art of creative data analysis has experienced despicable attacks over the last years. A small but annoyingly persistent group of second-stringers tries to denigrate our scientific achievements. They drag psychological science through the mire.
We’ll look at both zTot and nTot, and consider the player’s age and experience.The latter is potentially important because there have been shifts in what ages players joined the league over the timespan we are considering. It used to be rare for players to skip college, then it wasn’t, now they are required to play at least one year. It will be interesting to see if we see a difference in age versus experience in the numbers.
We start with the RDD containing all the raw stats, z-scores, and normalized z-scores. Another piece of data to consider is how a player’s z-score and normalized z-score change each year, so we’ll calculate the change in both from year to year. We’ll save off two sets of data, one a key-value pair of age-values, and one a key-value pair of experience-values. (Note that in this analysis, we disregard all players who played in 1980, as we don’t have sufficient data to determine their experience level.)
Jordan also looks at player performance over time and makes data analysis look pretty easy.
A lot of this process involves designing and analyzing A/B tests, particularly about changing our targeting algorithms, ad design, and other factors to improve clickthrough rate (CTR). This process is more statistically interesting than I’d expected, in some cases letting me find new uses for methods I’d used to analyze biological experiments, and in other cases encouraging me to learn new statistical tools. In fact, much of my series on applying Bayesian methods to baseball batting statistics is actually a thinly-veiled version of methods I’ve used to analyze CTR across ad campaigns.
Sounds like a fun place to be.
So I’ve spent a while now looking at 3 competing languages and I did my best to give each one a fair shake. Those 3 languages were F#, Python and R. I have to say it was really close for a while because each language has its strengths and weaknesses. That said, I am moving forward with 2 languages and a very specific way I use each one. I wanted to outline this, because for me it has taken a very long time to learn all of the languages to the level that I have to discover this and I would hate for others to go through the same exercise.
Read on for his decision, as well as how you go from “here’s some raw data” to “here are some services to expose interesting results.”
If you have no idea what Box-and-Whisker Plot is, please visit following link: http://www.wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-box-plots
At first, I will show how to do it based on AdventureWorks database in SQL Server 2014.
We will analyze amounts of Individual lines of Sales Orders within each month.
The first step is to create a Data Set to process. That Data Set will contain a Month, Single Line amount and order number of that record within a month.
This is really cool…but I wonder if it wouldn’t be better to do this in R, where it’d take a lot less code. If you can’t reach out to R, though, this is a good way of visualizing results.
Extrapolating beyond the range of training data, especially in the case of time series data, is fine providing the data-set is large enough.
Strong Evidence is same as a Proof! Prediction intervals and confidence intervals are the same thing, just like statistical significance and practical significance.
These are some good things to think about if you’re getting into analytics.
If you do a quick read through of some of the Gartner or O’Reilly studies you’ll quickly see that a lack of executive sponsorship is one of the major barriers to adoption. So isn’t the POC a good way to get the attention of the C-level? Yes and no.
If as we described above it leads to the adoption of a series of stand alone ‘technology projects’, then no. If it was really necessary to start with little firecracker POCs to demonstrate the explosive strategic value of becoming data-driven, then maybe so.
Here’s a simple change of mindset (borrowed from John Weathington referenced above) that instead of focusing on Proof of Concept, we should instead create projects to demonstrate Proof of Value. By focusing on value we change the orientation so that any projects are aligned with value to the company. In other words, they are aligned with the company’s strategic objectives.
This is an interesting argument which goes against my inclinations. Check it out.
It’s not a simple matter of “choose one from column B and two from column A” – you have to learn the processes, and then the tools, and then think about your situation. In other words, some things are complicated because they are…complicated. However:
There are some things you can consider out of the box. So I spoke with my friend Romit Girdhar while we were co-teaching in London last week, and he put together a great visualization. You can see them here, and download the PDF below. Thanks, Romit!
And of course they had to change the name—it wouldn’t be a Microsoft product if the name didn’t change every six months…
Apache Spark is a general purpose cluster computing platform which extends map-reduce to support multiple computation types including but not limited to stream processing and interactive queries. Last week IBM’s Moktar Kandil presented at the Tampa Hadoop and Tampa Data Science Group Joint meetup on the topic of exploring Apache Spark.
Following are some of the slides discussed in the meetup. To play with the ALS Recommendation engine notebook, please register at www.datascientistworkbench.com which is a free notebook for Apache Spark platform for educational purposes.
Check out the links.
Security is an obvious consideration which needs to be addressed up front. Data is a very valuable commodity and only people with appropriate access should be allowed to see it. What steps are going to be employed to ensure that happens? How much administration is going to be required to implement it? These questions need to be answered up front.
I want to extend special thanks to Ginger for putting security as the top item on the list. Also, this seems like a pretty good set of criteria for most projects, so definitely check it out.