A Quick Look At Data Visualization Tools

Vincent Wong walks us through data visualization tools on the market today:


When we talk about Echarts, we will usually compare it with Highcharts. The relationship between them is a bit like the relationship between WPS and Office.

Highcharts is also a visualization library which you have to pay for it if you are going to use it. It has many advantages, for example, its documents and tutorials, JS scripts, and CSS are very detailed. It saves time and allows you to pay more attention to learning and developing. What’s more, it is very stable.

There are some good tools on this list.

It’s All ETL (Or ELT) In The End

Robin Moffatt notes that ETL (and ELT) doesn’t go away in a streaming world:

In the past we used ETL techniques purely within the data-warehousing and analytic space. But, if one considers why and what ETL is doing, it is actually a lot more applicable as a broader concept.

  • Extract: Data is available from a source system
  • Transform: We want to filter, cleanse or otherwise enrich this source data
  • Load: Make the data available to another application

There are two key concepts here:

  • Data is created by an application, and we want it to be available to other applications
  • We often want to process the data (for example, cleanse and apply business logic to it) before it is used

Thinking about many applications being built nowadays, particularly in the microservices and event-driven space, we recognize that what they do is take data from one or more systems, manipulate it and then pass it on to another application or system. For example, a fraud detection service will take data from merchant transactions, apply a fraud detection model and write the results to a store such as Elasticsearch for review by an expert. Can you spot the similarity to the above outline? Is this a microservice or ETL process?

Things like this are reason #1 why I expect data platform jobs (administrator and developer) to be around decades from now.  The set of tools expand, but the nature of the job remains similar.

Simplified Disaster Recovery With dbatools

Chrissy LeMaire shows how you can make DR a lot easier with dbatools:

When we talk about Disaster Recovery or DR, it’s often coupled with the term High Availability or HA. Here are some definitions from my graduate course on HADR.

high availability

  • Deals with minor outages, and failover solutions are automated
  • The goal is to restore full system functionality in a short time

disaster recovery

  • Deals with major outages such as natural and man-made disasters
  • Focuses on manual processes and procedures to restore systems back to their original state
  • Characterized by a phased approach to restoring the primary site

In the context of SQL Server, HA would be Availability Groups (AG), Failover Clustering (FCI), Log Shipping and more. I won’t be addressing High Availability in this post, however.

Chrissy has a demo of everything in action, including running a series of tests to ensure that your DR site actually has everything.

New(ish) VLF Status: 4

Paul Randal points out a new VLF status which can appear if you’re using an Availability Group:

At least since I started working on the SQL Server team (just after 7.0 shipped) and since then there have only been two VLF status codes:

  • 0 = the VLF is not active (i.e. it can be (re)activated and overwritten)
  • (1 = not used and no-one seems to remember what it used to mean)
  • 2 = the VLF is active because at least one log record in it is ‘required’ by SQL Server for some reason (e.g. hasn’t been backed up by a log backup or scanned by replication)

A few weeks ago I learned about a new VLF status code that was added back in SQL Server 2012 but hasn’t come to light until recently (at least I’ve never encountered it in the wild). I went back-and-forth with a friend from Microsoft (Sean Gallardy, a PFE and MCM down in Tampa) who was able to dig around in the code to figure out when it’s used.

Read on to uncover the mysteries of the VLF status of 4.

Dealing With Large JSON Values

Bert Wagner investigates an issue he found where his long JSON strings were becoming NULL in SQL Server:

After a little bit more research, I discovered that the return type for JSON_VALUE is limited to 4000 characters.   Since JSON_VALUE is in lax mode by default, if the output has more than 4000 characters, it fails silently.

To force an error in future code I could use SELECT JSON_VALUE(@json, ‘strict $.FiveThousandAs’)  so at least I would be notified immediately of an problem with my  query/data (via failure).

Although strict mode will notify me of issues sooner, it still doesn’t help me extract all of the data from my JSON property.

Read on for the answer.

Deleting Top Records With An Order By Clause

Kenneth Fisher shows that deleting the top N records with an ORDER BY clause is not straightforward:

Did you know you can’t do this?

FROM SalesOrderDetail

Msg 156, Level 15, State 1, Line 8
Incorrect syntax near the keyword ‘ORDER’.

I didn’t. Until I tried it anyway. Turns out, it says so right in the limitations section of BOL. Fortunately, that same section does have a recommendation on how to move forward.

Read on for a couple of methods to do this.

Interpreting The Area Under The Receiver Operating Characteristic Curve

Roos Colman explains what a Receiver Operating Characteristic (ROC) curve is and how we interpret the Area Under the Curve (AUC):

The AUC can be defined as “The probability that a randomly selected case will have a higher test result than a randomly selected control”. Let’s use this definition to calculate and visualize the estimated AUC.
In the figure below, the cases are presented on the left and the controls on the right.
Since we have only 12 patients, we can easily visualize all 32 possible combinations of one case and one control. (Rcode below)

Expanding from this easy-to-follow example, Colman walks us through some of the statistical tests involved.  Check it out.

Building A Neural Network In R With Keras

Pablo Casas walks us through Keras on R:

One of the key points in Deep Learning is to understand the dimensions of the vector, matrices and/or arrays that the model needs. I found that these are the types supported by Keras.

In Python’s words, it is the shape of the array.

To do a binary classification task, we are going to create a one-hot vector. It works the same way for more than 2 classes.

For instance:

  • The value 1 will be the vector [0,1]
  • The value 0 will be the vector [1,0]

Keras provides the to_categorical function to achieve this goal.

This example doesn’t include using CUDA, but the data sizes are small enough that it doesn’t matter much.  H/T R-Bloggers


Jennifer Lyons shows us a few places where we can use histomaps to good effect:

Histomaps can be a good option when we are looking to visualize qualitative trends over time. The trick is that you need to be using two mutually exclusive variables. In Spark’s case he used time and power. In the example below, I am using time and staff’s satisfaction with their work environment. Imagine you are collecting open-ended survey data every quarter during your grant term. You code a person’s response into the mutually exclusive category of met, partially met, and not met regarding their expressed satisfaction.

Read the whole thing.

T-SQL Tuesday Trigger Roundup

Steve Jones has a bunch of links about triggers this month:

This month was my turn to host T-SQL Tuesday. I chose Trigger Headaches or Happinessas the topic, and I am glad that there have been quite a few responses.

I started the review almost immediately,and here are a few highlights. I separated these based on how I first thought of them after reading the post. If you think I’ve mischaracterized one, let me know.

In case you don’t want to add, it’s 9 to 13, so triggers are a headache.

It’s a good roundup of cautionary stories, tips, and tricks.


September 2018
« Aug