Another issue is that some Excel functions operate on selected data, whereas others act on the whole worksheet. If you select a column of data and use Find to identify certain characters, it will identify only those characters in your chosen column. If you now use Replace it will change all such characters in the entire worksheet – which is probably not what you wanted to do, and you may have unwittingly introduced new errors into your data without being aware of it.
The safest way to clean your data in Excel is to copy an individual column to a separate worksheet, perform all your cleaning operations in isolation until you’re happy with the result, then copy your cleaned data to your original sheet (or better still, to a new sheet that stores only clean data). The repeated use of Copy, Paste and using multiple worksheets to clean your data can become extremely messy.
Lee recommends three free tools, and they look like they’re worth trying out.
For each Stack Overflow question asked since the beginning of the site, the dataset includes:
- Question ID
- Creation date
- Closed date, if applicable
- Deletion date, if applicable
- Owner user ID (except for deleted questions)
- Number of answers
This is ideal for performing analyses such as:
The increase or decrease in questions in each tag over time
Correlations among tags on questions
Which tags tend to get higher or lower scores
Which tags tend to be asked on weekends vs weekdays
Rates of question closure or deletion over time
The speed at which questions are closed or deleted
This is pretty exciting. Getting good, high-quality data sets for demonstration and pedagogical purposes is time-consuming, so the fact that the Stack Exchange people are tossing one out our way could be a major time-saver.
I’ve worked professionally with databases for a living for around 20 years, to varying degrees.
I’ve worked with words professionally for roughly the same amount of time (as an author, editor, and publisher).
It’s only natural that I’d be interested in ways of mashing them together.
Check it out.
Weighting is used to make sure samples reproduce the underlying characteristics of the population they are drawn from. For instance, in the UK 51.6% of voters are women, of these 22.5% are above 65 years of age, 23.1% have a higher education and so on. The people selected to make part of a sample may be recruited to ensure they match these proportions.
Within the context of weighting there are two major sampling procedures: Quota and stratified sampling. The relevant difference between them is the latter uses some sort of randomization device while with the former the interviewer decides whether or not to interview a person with certain characteristics. YouGov presumably applied Quota Sampling for the Brexit survey, and the gender, age and education weights they applied are shown in the following image.
Read the whole thing.
Ned Bicare provides us a sure-fire method for getting our academic papers published:
“If you torture the data long enough, it will confess.”
This aphorism, attributed to Ronald Coase, sometimes has been used in a disrespective manner, as if it was wrong to do creative data analysis.In fact, the art of creative data analysis has experienced despicable attacks over the last years. A small but annoyingly persistent group of second-stringers tries to denigrate our scientific achievements. They drag psychological science through the mire.
We’ll look at both zTot and nTot, and consider the player’s age and experience.The latter is potentially important because there have been shifts in what ages players joined the league over the timespan we are considering. It used to be rare for players to skip college, then it wasn’t, now they are required to play at least one year. It will be interesting to see if we see a difference in age versus experience in the numbers.
We start with the RDD containing all the raw stats, z-scores, and normalized z-scores. Another piece of data to consider is how a player’s z-score and normalized z-score change each year, so we’ll calculate the change in both from year to year. We’ll save off two sets of data, one a key-value pair of age-values, and one a key-value pair of experience-values. (Note that in this analysis, we disregard all players who played in 1980, as we don’t have sufficient data to determine their experience level.)
Jordan also looks at player performance over time and makes data analysis look pretty easy.
A lot of this process involves designing and analyzing A/B tests, particularly about changing our targeting algorithms, ad design, and other factors to improve clickthrough rate (CTR). This process is more statistically interesting than I’d expected, in some cases letting me find new uses for methods I’d used to analyze biological experiments, and in other cases encouraging me to learn new statistical tools. In fact, much of my series on applying Bayesian methods to baseball batting statistics is actually a thinly-veiled version of methods I’ve used to analyze CTR across ad campaigns.
Sounds like a fun place to be.
So I’ve spent a while now looking at 3 competing languages and I did my best to give each one a fair shake. Those 3 languages were F#, Python and R. I have to say it was really close for a while because each language has its strengths and weaknesses. That said, I am moving forward with 2 languages and a very specific way I use each one. I wanted to outline this, because for me it has taken a very long time to learn all of the languages to the level that I have to discover this and I would hate for others to go through the same exercise.
Read on for his decision, as well as how you go from “here’s some raw data” to “here are some services to expose interesting results.”
If you have no idea what Box-and-Whisker Plot is, please visit following link: http://www.wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-box-plots
At first, I will show how to do it based on AdventureWorks database in SQL Server 2014.
We will analyze amounts of Individual lines of Sales Orders within each month.
The first step is to create a Data Set to process. That Data Set will contain a Month, Single Line amount and order number of that record within a month.
This is really cool…but I wonder if it wouldn’t be better to do this in R, where it’d take a lot less code. If you can’t reach out to R, though, this is a good way of visualizing results.
Extrapolating beyond the range of training data, especially in the case of time series data, is fine providing the data-set is large enough.
Strong Evidence is same as a Proof! Prediction intervals and confidence intervals are the same thing, just like statistical significance and practical significance.
These are some good things to think about if you’re getting into analytics.