If missing values are something which haunts you then
MICEpackage is the real friend of yours.
When we face an issue of missing values we generally go ahead with basic imputations such as replacing with 0, replacing with mean, replacing with mode etc. but each of these methods are not versatile and could result into a possible data discrepancy.
MICEpackage helps you to impute missing values by using multiple techniques, depending on the kind of data you are working with.
I’d heard of a couple of these, but most of them are new to me.
By analyzing the plot above, we can arrive at the following insights:
The number of crimes steadily decline from midnight and are at the lowest during the early morning hours and then they start increasing and peak around 6 PM in the evening. This is the same insight we arrived in my previous analysis but here we have categorized by the Police district and still see the same pattern.
As seen in the previous plot, Park and Richmond districts have the lowest number of crimes throughout the day.
As highlighted in red in the plot above, the maximum number of crimes happens in Southern district around 6 PM in the evening.
I would prefer to see code here, but it does serve to give you an idea of what R can do.
As more settlements in Texas and France are impacted by severe flooding, this is a good time to thank the hydrologists at the NOAA who forecast river level rises in advance and give residents in affected areas time to move to higher ground. Along with topgraphic, rainfall, and weather data, monitoring stations maintained by NOAA and the USGS along rivers provide critical real-time information about river levels. NOAA scientists access this data using the dataRetrieval package for R, which they then incorporate into flood prediction models and use to generate animations like this one of the flood of the Delaware in February this year
Looks like I’ve got a new blog to follow…
There are probably very few cases for which this is technically a good idea (trying to be a featured author on JunkCharts might very well be one of those reasons). Nonetheless, there are at least a couple of requests for this floating around on stackoverflow; here and here for example. I struggled to find any satisfactory solutions that were in current working order (though perhaps my Google-fu has failed me).
Jonathan is rather against this idea, and it does seem like the answer is a hack. I suppose the real answer is “sometimes an image isn’t worth a thousand words.”
For those people who haven’t made the decision as far as which tool to use, let me offer two compelling reasons to pick Visual Studio [VS] instead of R Studio: Intellisense and Improved Debugging Tools. R studio does not have intellisense and it is not possible to debug your code by stepping through it in the manner that many developers of VS are already quite familiar. You will need to configure VS to use R tools, which are detailed below.
Those are nice features. I’m still a big fan of R Studio, but have seen big improvements in R Tools for Visual Studio, so I imagine I’ll make the switch by the end of the year.
Apache Zeppelin, a web-based notebook, enables interactive data analytics including Data Ingestion, Data Discovery, and Data Visualization all in one place. Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin. Currently, Zeppelin supports many interpreters such as Spark (Scala, Python, R, SparkSQL), Hive, JDBC, and others. Zeppelin can be configured with existing Spark eco-system and share SparkContext across Scala, Python, and R.
This links to a rather long post on how to set up and use all of these pieces. I’m more familiar with Jupyter than Zeppelin, but regardless of the notebook you choose, this is a good exercise to become familiar with the process.
In each case, a number of different models are trained in R (decision forests, boosted decision trees, multinomial models, neural networks and poisson regression) and compared for performance; the best model is automatically selected for predictions.
On a related note, Microsoft recently teamed up with aircraft engine manufacturer Rolls-Royceto help airlines get the most out of their engines. Rolls-Royce is turning to Microsoft’s Azure cloud-based services — Stream Analytics, Machine Learning and Power BI — to make recommendations to airline executives on the most efficient way to use their engines in flight and on the ground. This short video gives an overview.
Check out the data set and play around a bit.
The SQL Server 2016 release date is June 1, 2016. Among the new features everyone is talking about is “R Services”.
Below is a quick reference of links related to “R Services”
Check it out.
Unlike most other statistical software packages, R doesn’t have a native data file format. You can certainly import and export data in any number of formats, but there’s no native “R data file format”. The closest equivalent is the
loadRDSfunction pair, which allows you to serialize an R object to a file and then load it back into a later R session. But these files don’t hew to a standardized format (it’s essentially a dump of R in-memory representation of the object), and so you can’t read the data with any software other than R.
The goal of the feather project, a collaboration of Wes McKinney and Hadley Wickham, is to create a standard data file format that can be used for data exchange by and between R, Python, and any other software that implements its open-source format. Data are stored in a computer-native binary format, which makes the files small (a 10-digit integer takes just 4 bytes, instead of the 10 ASCII characters required by a CSV file), and fast to read and write (no need to convert numbers to text and back again). Another reason why feather is fast is that it’s a column-oriented file format, which matches R’s internal representation of data. (In fact, feather is based on the Apache Arrow framework for working with columnar data stores.) When reading or writing traditional data files with R, it must spend signfican time translating the data from column format to row format and back again; with feather the entire second step in the process below is eliminated.
Given the big speedup in read time, I can see this file format being rather useful. I just can’t see it catching on as a common external data format, though, unless most tools get retrofitted to support the file. So instead, it’d end up closer to something like Avro or Parquet: formats we use in our internal tools because they’re so much faster, but not formats we send across to other companies because they’re probably using a different set of tools.
It’s not fast. The above piece of T-SQL took ~4 seconds to execute. This is on an Azure A3 VM. Not a great machine admittedly, but the R code, which just returns the first 6 rows of a built-in data set, ran in under a second on my desktop. This is likely not something you’ll be doing as part of an OLTP process.
I hope this external_script method is temporary. It’s ugly, hard to troubleshoot, and it means I have to write my R somewhere else, probably R Studio, maybe Visual Studio, and move it over once tested and working. I’d much rather see something like
I agree with the sp_execute_external_script mess. It’s the worst of dynamic SQL combined with multiple languages (T-SQL for the stored procedure & R for the contents, but taking care to deal with T-SQL single-quoting). Still, even with these issues, I think this will be a very useful tool for data analysts, particularly when dealing with rather large data sets on warehouse servers with plenty of RAM.