Some poking around in the NSW Transport Open Data portal reveals how many people enter every Sydney train station on a “typical” day in 2016, 2017 and 2018. We could manipulate those numbers in various ways to estimate total, unique passengers for FY 2017-18 but I’m going to argue that the value as-is serves as a proxy variable for “station busyness”.
When working with spatial data cases, it’s important to differentiate an effect you see because it’s actually unique or interesting versus an effect you see because that’s where all of the people are.
You can easily see how arbitrary the shapes can be almost magically discovered, through the principle of the nearest neighbor search.
The magic happens because the methodical approach of meeting and greeting the neighbors discovers more and more neighbors (and hence the visualization becomes denser and denser) as per the formation of the shape, and on the other hand, sparser and sparser as the traversal approaches the contours of those very shapes. The sparseness around the dense shapes provides the much-needed contrast to discover hidden shapes.
Read on for a very interesting explanation.
If there is one thing of general utility lacking in ggplot2 it is probably the ability to annotate data cleanly. Sure, there’s
geom_label()but using them requires a fair bit of fiddling to get the best placement and further, they are mainly relevant for labeling and not longer text.
ggrepelhas improved immensely on the fiddling part, but the lack of support for longer text annotation as well as annotating whole areas is still an issue.
In order to at least partly address this, ggforce includes a family of geoms under the
geom_mark_*()moniker. They all behaves equivalently except for how they encircle the given area(s).
There are some really interesting features in the
ggforce package, so check them out.
The reports follow a common template where the major difference is simply the hashtag. So one way to create these reports is to use the previous one, edit to find/replace the old hashtag with the new one, and save a new file.
That works…but what if we could define the hashtag once, then reuse it programmatically anywhere in the document? Enter Rmarkdown parameters.
The example is small but important.
Let’s say you’ve developed a predictive model in R, and you want to embed predictions (scores) from that model into another application (like a mobile or Web app, or some automated service). If you expect a heavy load of requests, R running on a single server isn’t going to cut it: you’ll need some kind of distributed architecture with enough servers to handle the volume of requests in real time.
This reference architecture for real-time scoring with R, published in Microsoft Docs, describes a Kubernetes-based system to distribute the load to R sessions running in containers.
Data manipulation and aggregation is one of the classic tasks anyone working with data will come across. We of course can perform data transformation and aggregation with base R, but when speed and memory efficiency come into play, data.table is my package of choice.
In this post we will look at of the fresh and very useful functionality that came to data.table only last year – grouping sets, enabling us, for example, to create pivot table-like reports with sub-totals and grand total quickly and easily.
Grouping sets are also available in SQL dialects and tend to be something people tend not to be aware of. This is a shame because they’re quite powerful, and Jozef shows how powerful they can be in R.
After some search, I found that there are a few packages in R which allow making exemplary flowcharts. The one which I found easy to use was DiagrammeR. The advantage of this packages is that generate diagrams using code within R Markdown syntax.
The taped-glasses nerd in me wants to point out that flow charts use geometric shapes to show flow and that this is more properly labeled a graph (the examples are directed acyclic graphs), but hush, taped-glasses nerd self.
A useful way of dealing with outliers is by running a robust regression, or a regression that adjusts the weights assigned to each observation in order to reduce the skew resulting from the outliers.
In this particular example, we will build a regression to analyse internet usage in megabytes across different observations. You will see that we have several outliers in this dataset. Specifically, we have three incidences where internet consumption is vastly higher than other observations in the dataset.
Let’s see how we can use a robust regression to mitigate for these outliers.
Click through for a demonstration.
2. feather instead of csv
The app relied on some pre-wrangled csv files; these have been replaced by files stored using the .feather format, which makes for a signficant performance improvement.
Martin has made a significant number of changes and it’s cool to see the full list of changes. H/T R-bloggers
If I type the letter a into the R Script editor, my code completion options are acts, always, and, and as. Power BI’s editor is not offering any IntelliSense options from a Python or R dictionary. Instead, it’s pulling from the text already in the editor. Note the comment in Line 1 and the inclusion of words beginning with the letter a — always, and, acts, as.
By comparison, the DAX editor contains a detailed function list and helpful annotations for code completion. Can we get something similar for R and Python? Not exactly… But there’s a workaround that I’m almost embarrassed to suggest. If you are a user who codes directly into the script editor, the following hack could be helpful. If you use the option to Edit script in External IDE, keep doing that and ignore the following guidance.
As-is, this is worse than no IntelliSense because at least with no IntelliSense, it’ll never steal a mouse click or keystroke. I wouldn’t expect RStudio level quality out of the gate but unless I’m missing something, that’s pretty bad.