Bar Plot Alternatives

Alboukadel Kassambara shows off a couple alternatives to bar charts:

Cleveland’s dot plot

Color y text by groups. Use y.text.col = TRUE.

ggdotchart(dfm, x = "name", y = "mpg", color = "cyl", # Color by groups palette = c("#00AFBB", "#E7B800", "#FC4E07"), # Custom color palette sorting = "descending", # Sort value in descending order rotate = TRUE, # Rotate vertically dot.size = 2, # Large dot size y.text.col = TRUE, # Color y text by groups ggtheme = theme_pubr() # ggplot2 theme )+ theme_cleveland() # Add dashed grids

I like the lollipop chart example.


Hadley Wickham announces dbplyr version 1.1.0:

Since you’ve read this far, I also wanted to touch on RStudio’s vision for databases. Many analysts have most of their data in databases, and making it as easy as possible to get data out of the database and into R makes a huge difference. Thanks to the community, R already has strong tools for talking to the popular open source databases. But support for connecting to enterprise databases and solving enterprise challenges has lagged somewhat. At RStudio we are actively working to solve these problems.

As well as dbplyr and DBI, we are working on many other pain points in the database ecosystem. You’ll hear much more about these packages in the future, but I wanted to touch on the highlights so you can see where we are heading. These pieces are not yet as integrated as they should be, but they are valuable by themselves, and we will continue to work to make a seamless database experience, that is as good as (or better than!) any other environment.

There’s some very interesting vision talk at the end, showing how Wickham and the RStudio group are dedicated to enterprise-grade R.

Building Graph Tables

Tomaz Kastrun uses a set of e-mails as his SQL Server 2017 graph table data source:

To put the graph database to the test, I took bunch of emails from a particular MVP SQL Server distribution list (content will not be shown and all the names will be anonymized). On my gmail account, I have downloaded some 90MiB of emails in mbox file format. With some python scripting,  only FROM and SUBJECTS were extracted:

for index, message in enumerate(mailbox.mbox(infile)): content = get_content(message) row = [ message['from'].strip('>').split('<')[-1], decode_header(message['subject'])[0][0],"|" ] writer.writerow(row)

This post walks you through loading data, mostly.  But at the end, you can see how easy it is to find who replied to whose e-mails.

Terminating Errors In Powershell

Adam Bertram explains terminating versus non-terminating errors in Powershell:

Non-terminating errors are still “errors” in PowerShell but not quite as severe as terminating ones. Non-terminating errors aren’t as serious because they do not halt script execution. Moreover, you can silence them, unlike terminating errors. You can create non-terminating errors with the Write-Error cmdlet. This cmdlet writes text to the error stream.

You can also manipulate non-terminating errors with the common ErrorAction and ErrorVariable parameters on all cmdlets and advanced functions. For example, if you’ve created an advanced function that contains a Write-Error reference, you can temporarily silence this as shown below.

Adam also shows how to convert a non-terminating error into a terminating error in your script.

Minimize Updates

Lukas Eder shows the importance of minimizing the scope of update statements:

Optionally, just as with JPA, you can turn on optimistic locking on this statement. The important thing here is that the clicks and purchases columns are left untouched, because they were not changed by the client code. This is different from JPA, which either sends all the values by default, or if you specify @DynamicUpdate in Hibernate, it would send only the last_name column, because while first_name was changed it was not modified.

My definition:

  • changed: The value is “touched”, its state is “dirty” and the state needs to be synched to the database, regardless of modification.
  • modified: The value is different from its previously known value. By necessity, a modified value is always changed.

As you can see, these are different things, and it is quite hard for a JPA-based API like Hibernate to implement changed semantics because of the annotation-based declarative nature of how entities are defined. We’d need some sophisticated instrumentation to intercept all data changes even when the values have not been modified (I didn’t make those attributes public by accident).

I found this an interesting walkthrough of data layer-level mechanisms that directly affect database performance.

Winnowing Down WhoIsActive Data

Kendra Little shows how to use temporary objects to pare down the results of sp_whoisactive for later storage:

I used the @schema parameter to have sp_WhoIsActive generate the schema for the table itself. Full instructions on doing this by Adam are here.

Since I care about tempdb in the case of this example, I used @output_column_list to specify that those columns should come first, followed by the rest of the columns.

I also elected to set @get_plans to 1 to get query execution plans if they’re available. That’s not free, and they can take up a lot of room, but they can contain a lot of helpful info.

This is a very useful guide, and also read the linked documentation for sp_whoisactive; there’s a huge amount of goodness in that one procedure.

BimlExpress’s Preview Pane

Ben Weissman shows off the preview pane in BimlExpress 2017:

Once you’ve installed BimlExpress 2017 and open your first Biml file, you will probably notice immediately, that the screen is split horizontally – that is exactly for the preview pane.

If you need more real estate for your actual code, just click the „Hide“ button at the lower left corner.
To actually get a preview, click the „Update“ button on the lower right:

That’s a pretty nice feature.  It can be hard sometimes to debug Biml issues because you’re often writing code to write code to write code.

Creating Docker Volumes

Andrew Pruski continues his series on long-term data storage and Docker:

Awesome stuff! We’ve got a database that was created in another container successfully attached into another one.

So at this point you may be wondering what the advantage is of doing this over mounting folders from the host? Well, to be honest, I really can’t see what the advantages are.

The volume is completely contained within the docker ecosystem so if anything happens to the docker install, we’ve lost the data. OK, OK, I know it’s in C:\ProgramData\docker\volumes\ on the host but still I’d prefer to have more control over its location.

It’s worth reading the whole thing, even though this isn’t the best way to keep data long-term.  It’s important to know about this strategy even if only to keep it from accidentally ruining your day later.

Neural Nets On Spark

Nisha Muktewar and Seth Hendrickson show how to use Deeplearning4j to build deep learning models on Hadoop and Spark:

Modern convolutional networks can have several hundred million parameters. One of the top-performing neural networks in the Large Scale Visual Recognition Challenge (also known as “ImageNet”), has 140 million parameters to train! These networks not only take a lot of compute and storage resources (even with a cluster of GPUs, they can take weeks to train), but also require a lot of data. With only 30000 images, it is not practical to train such a complex model on Caltech-256 as there are not enough examples to adequately learn so many parameters. Instead, it is better to employ a method called transfer learning, which involves taking a pre-trained model and repurposing it for other use cases. Transfer learning can also greatly reduce the computational burden and remove the need for large swaths of specialized compute resources like GPUs.

It is possible to repurpose these models because convolutional neural networks tend to learn very general features when trained on image datasets, and this type of feature learning is often useful on other image datasets. For example, a network trained on ImageNet is likely to have learned how to recognize shapes, facial features, patterns, text, and so on, which will no doubt be useful for the Caltech-256 dataset.

This is a longer post, but on an extremely interesting topic.

Running H2O In R On Azure HDInsight

Daisy Deng shows how to configure HDInsight to be able to run the H2O package in R rather than Python or Scala:

We provide a few script actions for installing rsparkling on Azure HDInsight. When creating the HDInsight cluster, you can run the following script action for header node:

And run the following action for the worker node:

Please consult Customize Linux-based HDInsight clusters using Script Action for more details.

Click through for the full process.


June 2017
« May