Techniques for Determining Cluster Counts

Stephanie Glen has an image which describes different techniques for figuring out how many clusters you have:

If you want to determine the optimal number of clusters in your analysis, you’re faced with an overwhelming number of (mostly subjective) choices. Note that there’s no “best” method, no “correct” k, and there isn’t even a consensus as to the definition of what a “cluster” is. With that said, this picture focuses on three popular methods that should fit almost every need: Silhouette, Elbow, and Gap Statistic.

Click through for the picture and references.

Scalable Anomaly Detection with Kafka and Cassandra

Paul Brebner wraps up a series on anomaly detection at scale:

The complete machine for the biggest result (48 Cassandra nodes) has 574 cores in total.  This is a lot of cores! Managing the provisioning and monitoring of this sized system by hand would be an enormous effort. With the combination of the Instaclustr managed Cassandra and Kafka clusters (automated provisioning and monitoring), and the Kubernetes (AWS EKS) managed cluster for the application deployment it was straightforward to spin up clusters on demand, run the application for a few hours, and delete the resources when finished for significant cost savings. Monitoring over 100 Pods running the application using the Prometheus Kubernetes operator worked smoothly and gave enhanced visibility into the application and the necessary access to the benchmark metrics for tuning and reporting of results.

The system (irrespective of size) was delivering an approximately constant 400 anomaly checks per second per core.

This is a good summary of what was an interesting series.

Thoughts on SQL Notebooks

Emanuele Meazzo takes a look at the current state of SQL Notebooks in Azure Data Studio:

I’ve personally used SQL Notebook in my day-to-day work for Data Analysis, as the possibility to tweak the code and run it in the notebook greatly enhances the presentation of the data as oppose to a commented SQL Script ,as you cannot see all the query results in the same page too as opposed to a notebook; Moreover, a notebook (with or without results) can be exported in a read-only format like html or pdf to share the info with third parties, i.e. you can automate an analysis process that include code to be shared, cool stuff.

I think there are still a few (dozen) things to iron out before it’s a great experience, but they’re on the right path with it. If you haven’t checked out Azure Data Studio and its SQL Notebooks, give it a try sometime.

Predicting Database Growth

James Livingston uses linear regression to plot database growth over time:

Utilizing the equation for a line, instead of solving for y we will solve for x, where:
– x corresponds to the day we will hit capacity based on current growth rate
– y corresponds to drive capacity in GB
– m is the slope of our regression line, provided by the model via lm.coef_
– b is the intercept of the regression line, also provided by the model via lm.intercept_

Click through for an example. This is one of the areas where DBAs can gain a lot by learning a bit of data science.

Considerations Before Failing Over an AG

David Fowler points out a few things to look at before failing over an Availability Group:

What about your SQL Agent jobs? Have you got jobs that perform actions on your data? If you have, do those jobs exist on the new primary? If they don’t then I’m happy to bet that whatever function that they were playing probably isn’t happening anymore.

One thing that I always want to make sure before I failover is, do I have all the relevant jobs ready to roll on the secondary server?

But what about the jobs that you’ve got on the old primary? There’s a fair chance that, if they’re doing any sort of data manipulation, they’re going to be failing.

Click through for additional considerations.

Generating SELECT Statements with Biml

Kevin Feasel



Cathrine Wilhelmsen shows the easy way to build out a SELECT statement with Biml:

Biml column methods return code fragments. These code fragments can be used as building blocks to generate custom T-SQL statements. For example, the GetColumnList method returns a list of columns, separated by commas, that you can use in a SELECT statement. You can filter the columns and customize the output by passing parameters.

Biml is for more than just SSIS packages.

Blocked Queries and the Plan Cache

Erik Darling lays out an argument why execution time isn’t always a great measure for finding bad queries:

Most scripts (even ones I’ve worked on!) that look at the plan cache, have had the ability to sort it by different metrics: CPU, reads, writes, duration, etc.

A lot of people are very interested in long running queries — and I am too!

Heck, they’re how I make money. Blogging pays like crap.

But there’s a slight problem with only looking at query duration.

Read on for an example. Erik loses money on every blog post but he makes up for it in volume.

Azure Cloud Shell

Mark Broadbent gives us an introduction to Azure Cloud Shell:

There are two ways to access Azure Cloud Shell, the first being directly through the Azure Portal itself. Once authenticated, look to the top right of the Portal and you should see a grouping of icons and in particular, one that looks very much like a DOS prompt (have no fear, DOS is nowhere to be seen).

The second method to access Azure Cloud Shell is by jumping directly to it via which will require you to authenticate to your subscription before launching. There is an ever so slight difference between each method. Accessing the Shell via the Azure Portal will not require you to specify your Azure directory context (assuming you have several) since your Portal will have already defaulted to one, whereas with the direct URL method that obviously doesn’t happen.

Read the whole thing.

Data Layout in R with cdata

John Mount takes us through a few sample problems and how to reshape data with cdata:

This may seem like a lot of steps, but it is only because we are taking the problems very slowly. The important point is that we want to minimize additional problem solving when applying the cdata methodology. Usually when you need to transform data you are in the middle of some other more important task, so you want to delegate the details of how the layout transform is implemented. With cdata the user is not asked to perform additional puzzle solving to guess a sequence of operators that may implement the desired data layout transform. The cdata solution pattern is always the same, which can help in mastering it.

With cdata, record layout transforms are simple R objects with detailed print() methods- so they are convenient to alter, save, and re-use later. The record layout transform also documents the expected columns and constants of the incoming data.

Check it out.

Exploratory Analysis of Earthquake Data

Giorgio Garziano walks us through an earthquake data set:

Boxplots for each quantitative variables are shown. We take advantage of the quantitative variable names (quantitative_vars) determined before to apply a ggplot2 package based boxplot function. The Y axis labeling and title are determined by the variable to be plot. Further, legend is not displayed and we adopt the coordinate flip option for improved readability.

Check it out to get an idea of how to do exploratory data analysis.


April 2019
« Mar May »