The general approach behind each of the examples that we’ll cover below is to:
Fit a regression model to predict variable (Y).
Obtain the predicted and residual values associated with each observation on (Y).
Plot the actual and predicted values of (Y) so that they are distinguishable, but connected.
Use the residuals to make an aesthetic adjustment (e.g. red colour when residual in very high) to highlight points which are poorly predicted by the model.
The post is about 10% understanding what residuals are and 90% showing how to visualize them and spot major discrepancies.
RStudio has several ways to import data. One of the easiest ways is to import via URL. This link (https://data.montgomerycountymd.gov/api/views/6rqk-pdub/rows.csv?accessType=DOWNLOAD) gives us the salaries of all of the government employees for Montgomery County, MD in a CSV format. To import this into RStudio, copy the URL and go to Tools -> Import Dataset -> From Web URL…
R and Python are both good languages to learn for data analysis. I lean just a little bit toward R, but they’re both strong choices in this space.
To collect the data on all the first generation pokemon, I employ Hadley Wickam’s rvest package. I find it very intuitive and can handle all of my needs in collecting and extracting the data from a pokemon wiki. I will grab all the Pokemon up until to Gen II, which constitutes 251 individuals. I did find the website structure a bit of a pain as each pokemon had very different looking web pages. But, with some manual hacking, I eventually got the data in a nice format.
This probably means a lot more to you if you grew up in front of a Game Boy, but there’s some good technique in here regardless.
I’ve reproduced Sharon’s code and charts below. I did make a couple of tweaks to the code, though. I added a call to checkpoint(“2016-08-22”) which, if you’ve saved the code to a file, will install all the necessary packages for you. (I also verified that the code runs with package versions as of today’s date, and if you’re trying out this code at a later time it will continue to do so, thanks to checkpoint.) I also modified the data download code to make it work more easily on Windows. Here are the charts and code
It’s really easy to get basic visualizations within R, and these are better than basic visualizations.
Setting external resource pool for execution of R commands using sp_execute_external_script has proven extremely useful, especially in cases where you have other workers present, when you don’t want to overdo on data analysis and get to much resources from others (especially when running data analysis in production environment) or when you know that your data analysis will require maximum CPU and memory for a specific period. In such cases using and definingexternal resource pool is not only useful but highly recommended.
Resource governor is a feature that enables you to manage SQL Server workload and system resource consumption their limits. Limits can be configures for any workload in terms of CPU, Memory and I/O consumption. Where you have many different workloads on the same SQL Server, resource Governor helps allocate requested resources.
If you’re concerned about R soaking up all of your server’s memory, Resource Governor is a great way of limiting that risk.
When referring to what can be done in iOS, Apple often say that there is an “app” for that. Likewise, when R developers refer to what can be done in R, we often say that there is a “package” for that. For instance:
· If one needs to scrap data from the web there are packages for that (rvest, rcurl, and others)
· If one needs to make complicated transformations to their data there are packages for that (dplyr, tidyr, lubrdiate, stringr, and others)
I like the F#-ness of M, but I admit that I’m happy there’s some fairly close R integration within Power BI, as that means there’s one fewer language I need to learn right now…
According to the results of the 2016 survey, R is the preferred tool for 42% of analytics professionals, followed by SAS at 39% and Python at 20%. While Python’s placing may at first appear to relegate the language to Bronze Medal status, it’s the delta here that really matters.
It’s interesting to see the breakdowns of who uses which language, comparing across industry, education, work experience, and geographic lines.
Result in this case will be successful with correct R results and sp_execute_external_script will not return error for missing libraries.
I added a “fake” library called test123 for testing purposes if all the libraries will be installed successfully.
At the end the script generated xp_cmdshell command (in one line)
This is a rather clever solution to a problem which I’d rather not exist. There really ought to be a better way for authorized users programmatically to install packages.
focus()works similarly to
select()from the dplyr package (which is loaded along with the corrr package). You add the names of the columns you wish to keep in your correlation data frame. Extending
focus()will then remove the remaining column variables from the rows. This is why
mpgdoes not appear in the rows above. Here’s another example with two variables:
Click through for the entire article.
The house edge is 2.70%. On average a gambler would lose 2.7% of his stake per game. Of course, on any one game he would either win or lose, but this is the long term expectation. Another way of looking at this is to say that the Return To Player (RTP) is 97.3%, which means that on average a gambler would get back 97.3% of his stake on every game.
Below are the results of a simulation of 100 gamblers betting on even numbers. Each starts with an initial capital of 100. The red line represents the average for the cohort. After 1000 games two gamblers have lost all of their money. Of the remaining 98 players, only 24 have made money while the rest have lost some portion of their initial capital.
This is a very interesting article if you’re interested in basic statistics. 13-year-old Onion article of note.