Press "Enter" to skip to content

Author: Kevin Feasel

Visualizing Cholesterol Data With ggplot2

Anisa Dhana uses the National Health and Nutrition Examination Survey and visualizes results with ggplot2:

From the plots above I find that regardless the different levels of diastolic and systolic blood pressure there is no substantial correlation between cholesterol and blood pressure. However, it is better to build the correlation line with geom_smooth or to calculate the Spearman correlation, although in this post we focus only on the visualization.

Lets build the correlation line.

Click through for several examples of visuals.

Comments closed

Microsoft + R

David Smith points out a bunch of the ways that Microsoft integrates R into products:

You can call R from within some data oriented Microsoft products, and apply R functions (from base R, from packages, or R functions you’ve written) to the data they contain.

  • SQL Server (the database) allows you to call R from SQL, or publish R functions to a SQL Server for database adminstrators to use from SQL.

  • Power BI (the reporting and visualization tool) allows you to call R functions to process data, create graphics, or apply statistical models to data.

  • Visual Studio (the integrated development environment) includes R as a fully-supported language with syntax highlighting, debugging, etc.

  • R is supported in various cloud-based services in Azure, including the Data Science Virtual Machine and Azure Machine Learning Studio. You can also publish R functions to Azure with the AzureML package, and then call those R functions from applications like Excel or apps you write yourself.

They’re pretty well invested in both R and Python, which is a good thing.

Comments closed

Using cowplot With ggplot2

I have a post on extending ggplot2’s functionality with cowplot:

Notice that I used geom_path().  This is a geom I did not cover earlier in the series.  It’s not a common geom, though it does show up in charts like this where we want to display data for three variables.  The geom_line() geom follows the basic rules for a line:  that the variable on the y axis is a function of the variable on the x axis, which means that for each element of the domain, there is one and only one corresponding element of the range (and I have a middle school algebra teacher who would be very happy right now that I still remember the definition she drilled into our heads all those years ago).

But when you have two variables which change over time, there’s no guarantee that this will be the case, and that’s where geom_path() comes in.  The geom_path() geom does not plot y based on sequential x values, but instead plots values according to a third variable.  The trick is, though, that we don’t define this third variable—it’s implicit in the data set order.  In our case, our data frame comes in ordered by year, but we could decide to order by, for example, life expectancy by setting data = arrange(global_avg, m_lifeExp).  Note that in a scenario like these global numbers, geom_line() and geom_path() produce the same output because we’ve seen consistent improvements in both GDP per capita and life expectancy over the 55-year data set.  So let’s look at a place where that’s not true.

The cowplot library gives you an easier way of linking together different plots of different sizes in a couple lines of code, which is much easier than using ggplot2 by itself.

Comments closed

Web Scraping With Power BI

Imke Feldmann shows how to use Power BI to scrape multiple tables from a webpage:

I will present 2 methods here:

  1. Append-method: This is the obvious one and is fast for just a few tables.
  2. Add-Column-method: A bit more complicated but will be faster for a large number of tables and is also suitable for a dynamic number of tables.

You will also find 2 options at the end of this article:

  1. Use custom functions for multi-step table transformations

  2. Use dynamic filters to select the desired tables

Read the whole thing.

Comments closed

Using Date Types In Warehouses

Koen Verbeeck argues that date keys in warehouses should be actual date types:

The worst are by far the string representation, as there is no actual check on the contents. It can literally contain everything. And is ’01/02/2018′ the first of February 2018 (like any sane person would read, because days come before months), or the 2nd of January? So if you have to store dates in your data warehouse, avoid strings at all costs. No excuses.

The integer representation – e.g. 20171208 – is really popular. If I recall Kimball correctly, he said it’s the one exception where you can use smart keys, aka surrogate keys that have a meaning embedded into them. I used them for quite some time, but I believe I have found a better alternative: using the actual date data type.

I bounce back and forth, but I’m sympathetic to Koen’s argument, which you can read by clicking through.

Comments closed

An Introduction To Splunk

Victoria Holt has some basics on Splunk:

Splunk, a software platform, has the capability to leverage machine data for data management and analytics.  It can be used for

  • Data driven decision making
  • Alerts for network security threats
  • Report on system failures
  • Analyse and improve functionality

It enables performance analysis, dashboard creation, monitoring, troubleshooting and investigation of the real-time data collected. A Edureka learning video showed the Splunk components.

Advanced Splunk queries are still a bit like magic to me, but this is a very powerful service once you get a handle on how it works.

Comments closed

Appropriate Data Types And Unicode

Raul Gonzalez on (in)appropriate use of National character strings:

Yes, you have read it… I see dates stored as NVARCHAR(10) and NCHAR(10) on daily basis, please don’t ask me why.

This case is even worse, because DATE takes 3 bytes where NCHAR(10) takes 20 bytes, yes Ladies and Gentlemen more than 6 times more space to store the same data.

But wait! how can you be certain that those ten characters are actually a valid date? You can’t, unless you reinvent the wheel and validate that those dates are obviously valid dates and pay the performance penalty of doing it.

You’d think that picking the right data type for something would be fairly easy and then you find a table with a few dozen NVARCHAR(MAX) columns.

Comments closed

Database Mirroring And SQL Server Editions

Bob Pusateri points out something important about database mirroring:

A few weeks ago I had a reminder of one of the finer points of the requirements for mirroring: mirrored servers need to be running not only the same version of SQL Server, but the same edition as well.

I hadn’t thought of this in a while, but it makes sense. Asynchronous Database Mirroring (also known as “High-Performance Mode”) is only available in Enterprise Edition, while Standard Edition only supports “High Safety Mode”, which is synchronous. If the primary server in a mirroring topology was Enterprise Edition, but the mirror server was Standard Edition, how would that work?

Also check Glenn Berry’s comment for a trick which stretches mirroring’s capabilities for upgrades.

Comments closed

Investigating The gcForest Algorithm

William Vorhies describes a new algorithm with strong potential:

gcForest (multi-Grained Cascade Forest) is a decision tree ensemble approach in which the cascade structure of deep nets is retained but where the opaque edges and node neurons are replaced by groups of random forests paired with completely-random tree forests.  In this case, typically two of each for a total of four in each cascade layer.

Image and text problems are categorized as ‘feature learning’ or ‘representation learning’ problems where features are neither predefined nor engineered as in traditional ML problems.  And the basic rule in these feature discovery problems is to go deep, using multiple layers each of which learns relevant features of the data in order to classify them.  Hence the multi-layer structure so familiar with DNNs is retained.

By using both random forests and completely-random tree forests the authors gain the advantage of diversity.  Each forest contains 500 completely random trees allowed to split until each leaf node contains only the same class of instances making the growth self-limiting and adaptive, unlike the fixed and large depth required by DNNs.

The estimated class distribution forms a class vector which is then concatenated with the original feature vector to be the input of the next level cascade.  Not dissimilar from CNNs.

The final model is a cascade of cascade forests.  The final prediction is obtained by aggregating the class vectors and selecting the class with the highest maximum score.

Click through for how it fares on the normal sample data sets like MNIST.

Comments closed

Simple Data Transformation Tricks In R

Abdul Majed Raja has a few tidyverse-friendly data transformation tips:

Splitting a column to many columns is a cliched Data Transformation case that’s hardly unseen while performing Data Transformation. While it’s straightforward to do this in Microsoft Excel, it’s slightly tricky using Data analytics languages. That is true until this function separate() from tidyr came.

These are small but helpful tips.

Comments closed