Press "Enter" to skip to content

Curated SQL Posts

Dependencies as Risks

John Mount makes the point that packages dependencies are innately a risk:

If your software or research depends on many complex and changing packages, you have no way to establish your work is correct. This is because to establish the correctness of your work, you would need to also establish the correctness of all of the dependencies. This is worse than having non-reproducible research, as your work may have in fact been wrong even the first time.

Low dependencies and low complexity dependencies can also be wrong, but in this case there at least exists the possibility of checking things or running down and fixing issues.

There are some insightful comments on this post as well, so check those out. This is definitely an area where there are trade-offs, so trying to reason through when to move in which direction is important.

Comments closed

Custom ggplot2 Fonts

Daniel Oehm shares two techniques for using custom fonts in your ggplot2 visuals:

ggplot – You can spot one from a mile away, which is great! And when you do it’s a silent fist bump. But sometimes you want more than the standard theme.

Fonts can breathe new life into your plots, helping to match the theme of your presentation, poster or report. This is always a second thought for me and need to work out how to do it again, hence the post.

There are two main packages for managing fonts – extrafont, and showtext.

Read on to see how to use each of these packages. H/T R-bloggers

Comments closed

Sending Highlighted Code From VS Code Via SSH

Anthony Nocentino shows how you can use Visual Studio Code to highlight and then send code via SSH to a remote machine:

You can create a custom keyboard shortcut in VS Code (And Azure Data Studio too) that gives you this functionality. Highlight code, press a button and execute that code in the active terminal, which just so happens to be SSH’d into a remote host.

Head over to Preferences->Keyboard Shortcuts (Picture 1) and in there you’ll find a shortcut called “Terminal: Run Selected Text In Active Terminal” (Picture 2). This is exactly what I want. Now, when I’m presenting…I can highlight the code…and what I highlighted gets copied into the terminal below and executed on whatever system is active in the terminal below. This could be either my local computer or a remote system over SSH.

Anthony’s use case is specifically around presentations but it could also be good for general use.

Comments closed

SSIS on Windows Containers

Andy Leonard is a man who doesn’t like to take “no” for an answer:

Seriously, since I hopped the fence from developer to data I’ve dreamed of the day when I could practice lifecycle management with data-stuff like I used to practice lifecycle management with software development.
I recognize the obstacles. The greatest obstacle (in my humble opinion) is software is mostly stateless these days (these days started with Object-Oriented Programming and include its descendants).

Stateless development solves lots of engineering problems in lifecycle management, and by “solves a lot of engineering problems” I mean some engineering problems simply don’t exist so lifecycle management for stateless stuff can simply ignore a “lot of engineering problems.”

A database, on the other hand, is all about that state. When it comes to managing lifecycle for a stateful platform – like a database – ACID gets dumped on many lifecycle management tools and solutions (see what I did there?).

Is it possible to manage a data-related lifecycle using stateless tools? Yes. But here there be obstacles. Let’s look at on use case:

Click through for more thoughts and setup for a new series.

Comments closed

Retaining a Few Tables From a Large Set

Jana Sattainathan has a Powershell-based solution to eliminate all but a few tables in a database:

Recently, I received a request to backup a dozen tables or so tables out of 12 thousand tables. I had to retain all the indexes, statistics etc. The goal was to hand this over to the vendor for analysis as a database backup.

I could have copied the selected tables over to a new database using the PowerShell function I had published earlier and backed that up but since the tables to backup were quite large, I skipped that route

Read on to see Jana’s solution.

Comments closed

Flattening Dimensional Models

Reza Rad explains why it makes sense to build flat dimensional models, particularly for Power BI:

The article that I wrote earlier this week about the shared dimension had a lot of interest, and I’m glad it helped many of you. So I thought better to write about the basics of modeling even more. In this article, I will be focusing on a scenario that you have all faced, however, took different approaches. Is it good to have too many dimension tables? can you combine some of those tables together to build one flatten dimension table? how much should you flatten it? should you end up with one huge table including everything? In this article, I’m answering all of these questions and explaining the scenarios of combining dimensions, as usual, I explain the model in Power BI. However, the concepts are applicable to any other tools. If you like to learn more about Power BI; read Power BI book from Rookie to Rock Star.

Given how closely the ideal Power BI data model matches the Kimball model, Reza’s advice makes perfect sense.

Comments closed

Handling Errors in Kafka Connect

Robin Moffatt shows us some techniques for handling errors in your Kafka topics:

We’ve seen how setting errors.tolerance = all will enable Kafka Connect to just ignore bad messages. When it does, by default it won’t log the fact that messages are being dropped. If you do set errors.tolerance = all, make sure you’ve carefully thought through if and how you want to know about message failures that do occur. In practice that means monitoring/alerting based on available metrics, and/or logging the message failures.

The most simplistic approach to determining if messages are being dropped is to tally the number of messages on the source topic with those written to the output:

Read on for a few different tactics and how you can implement them.

Comments closed

Unit Testing R Code

John Mount points out that you don’t need special infrastructure to perform unit testing in R:

There seems to be a general (false) impression among non R-core developers that to run tests, R package developers need a test management system such as RUnit or testthat. And a further false impression that testthat is the only R test management system. This is in fact not true, as R itself has a capable testing facility in “R CMD check” (a command triggering R checks from outside of any given integrated development environment).

By a combination of skimming the R-manuals ( https://cran.r-project.org/manuals.html ) and running a few experiments I came up with a description of how R-testing actually works. And I have adapted the available tools to fit my current preferred workflow. This may not be your preferred workflow, but I have and give my reasons below.

Food for thought for any R developer.

Comments closed

Investigating Azure Data Explorer

James Serra digs into how you can use Azure Data Explorer:

Azure Data Explorer (ADX) was announced as generally available on Feb 7th.  In short, ADX is a fully managed data analytics service for near real-time analysis on large volumes of data streaming (i.e. log and telemetry data) from such sources as applications, websites, or IoT devices.  ADX makes it simple to ingest this data and enables you to perform complex ad-hoc queries on the data in seconds – ADX has speeds of up to 200MB/sec per node (currently up to 3 nodes) and queries across a billion records take less than a second.  A typical use case is when you are generating terabytes of data from which you need to understand quickly what that data is telling you, as opposed to a traditional database that takes longer to get value out of the data because of the effort to collect the data and place it in the database before you can start to explore it.

It’s a tool for speculative analysis of your data, one that can inform the code you build, optimizing what you query for or helping build new models that can become part of your machine learning platform.  It can not only work on numbers but also does full-text search on semi-structured or un-structured data.  One of my favorite demo’s was watching a query over 6 trillion log records, counting the number of critical errors by doing a full-text search for the word ‘alert’ in the event text that took just 2.7 seconds.  Because of this speed, ADX can be a replacement for search and log analytics engines such as elasticsearch or Splunk.  One way I heard it described that I liked was to think of it as an optimized cache on top of a data lake.

Click through for James’s explanation and where you might want to use ADX.

Comments closed