Press "Enter" to skip to content

Day: June 29, 2017

Bar Plot Alternatives

Alboukadel Kassambara shows off a couple alternatives to bar charts:

Cleveland’s dot plot

Color y text by groups. Use y.text.col = TRUE.

ggdotchart(dfm, x = "name", y = "mpg", color = "cyl", # Color by groups palette = c("#00AFBB", "#E7B800", "#FC4E07"), # Custom color palette sorting = "descending", # Sort value in descending order rotate = TRUE, # Rotate vertically dot.size = 2, # Large dot size y.text.col = TRUE, # Color y text by groups ggtheme = theme_pubr() # ggplot2 theme )+ theme_cleveland() # Add dashed grids

I like the lollipop chart example.

Comments closed

dbplyr

Hadley Wickham announces dbplyr version 1.1.0:

Since you’ve read this far, I also wanted to touch on RStudio’s vision for databases. Many analysts have most of their data in databases, and making it as easy as possible to get data out of the database and into R makes a huge difference. Thanks to the community, R already has strong tools for talking to the popular open source databases. But support for connecting to enterprise databases and solving enterprise challenges has lagged somewhat. At RStudio we are actively working to solve these problems.

As well as dbplyr and DBI, we are working on many other pain points in the database ecosystem. You’ll hear much more about these packages in the future, but I wanted to touch on the highlights so you can see where we are heading. These pieces are not yet as integrated as they should be, but they are valuable by themselves, and we will continue to work to make a seamless database experience, that is as good as (or better than!) any other environment.

There’s some very interesting vision talk at the end, showing how Wickham and the RStudio group are dedicated to enterprise-grade R.

Comments closed

Building Graph Tables

Tomaz Kastrun uses a set of e-mails as his SQL Server 2017 graph table data source:

To put the graph database to the test, I took bunch of emails from a particular MVP SQL Server distribution list (content will not be shown and all the names will be anonymized). On my gmail account, I have downloaded some 90MiB of emails in mbox file format. With some python scripting,  only FROM and SUBJECTS were extracted:

writer.writerow(['from','subject'])
for index, message in enumerate(mailbox.mbox(infile)): content = get_content(message) row = [ message['from'].strip('>').split('<')[-1], decode_header(message['subject'])[0][0],"|" ] writer.writerow(row)

This post walks you through loading data, mostly.  But at the end, you can see how easy it is to find who replied to whose e-mails.

Comments closed

Terminating Errors In Powershell

Adam Bertram explains terminating versus non-terminating errors in Powershell:

Non-terminating errors are still “errors” in PowerShell but not quite as severe as terminating ones. Non-terminating errors aren’t as serious because they do not halt script execution. Moreover, you can silence them, unlike terminating errors. You can create non-terminating errors with the Write-Error cmdlet. This cmdlet writes text to the error stream.

You can also manipulate non-terminating errors with the common ErrorAction and ErrorVariable parameters on all cmdlets and advanced functions. For example, if you’ve created an advanced function that contains a Write-Error reference, you can temporarily silence this as shown below.

Adam also shows how to convert a non-terminating error into a terminating error in your script.

Comments closed

Minimize Updates

Lukas Eder shows the importance of minimizing the scope of update statements:

Optionally, just as with JPA, you can turn on optimistic locking on this statement. The important thing here is that the clicks and purchases columns are left untouched, because they were not changed by the client code. This is different from JPA, which either sends all the values by default, or if you specify @DynamicUpdate in Hibernate, it would send only the last_name column, because while first_name was changed it was not modified.

My definition:

  • changed: The value is “touched”, its state is “dirty” and the state needs to be synched to the database, regardless of modification.
  • modified: The value is different from its previously known value. By necessity, a modified value is always changed.

As you can see, these are different things, and it is quite hard for a JPA-based API like Hibernate to implement changed semantics because of the annotation-based declarative nature of how entities are defined. We’d need some sophisticated instrumentation to intercept all data changes even when the values have not been modified (I didn’t make those attributes public by accident).

I found this an interesting walkthrough of data layer-level mechanisms that directly affect database performance.

Comments closed

Winnowing Down WhoIsActive Data

Kendra Little shows how to use temporary objects to pare down the results of sp_whoisactive for later storage:

I used the @schema parameter to have sp_WhoIsActive generate the schema for the table itself. Full instructions on doing this by Adam are here.

Since I care about tempdb in the case of this example, I used @output_column_list to specify that those columns should come first, followed by the rest of the columns.

I also elected to set @get_plans to 1 to get query execution plans if they’re available. That’s not free, and they can take up a lot of room, but they can contain a lot of helpful info.

This is a very useful guide, and also read the linked documentation for sp_whoisactive; there’s a huge amount of goodness in that one procedure.

Comments closed

BimlExpress’s Preview Pane

Ben Weissman shows off the preview pane in BimlExpress 2017:

Once you’ve installed BimlExpress 2017 and open your first Biml file, you will probably notice immediately, that the screen is split horizontally – that is exactly for the preview pane.

If you need more real estate for your actual code, just click the „Hide“ button at the lower left corner.
To actually get a preview, click the „Update“ button on the lower right:

That’s a pretty nice feature.  It can be hard sometimes to debug Biml issues because you’re often writing code to write code to write code.

Comments closed

Creating Docker Volumes

Andrew Pruski continues his series on long-term data storage and Docker:

Awesome stuff! We’ve got a database that was created in another container successfully attached into another one.

So at this point you may be wondering what the advantage is of doing this over mounting folders from the host? Well, to be honest, I really can’t see what the advantages are.

The volume is completely contained within the docker ecosystem so if anything happens to the docker install, we’ve lost the data. OK, OK, I know it’s in C:\ProgramData\docker\volumes\ on the host but still I’d prefer to have more control over its location.

It’s worth reading the whole thing, even though this isn’t the best way to keep data long-term.  It’s important to know about this strategy even if only to keep it from accidentally ruining your day later.

Comments closed