Press "Enter" to skip to content

Day: March 15, 2021

The Basics of k-Means Clustering

Nathaniel Schmucker explains some of the principles of k-means clustering:

k-Means is easy to implement. In R, you can use the function kmeans() to quickly deploy an efficient k-Means algorithm. On datasets of reasonable size (thousands of rows), the kmeans function runs in fractions of a second.

k-Means is easy to interpret (in 2 dimensions). If you have two features of your k-Means analysis (e.g., you are grouping by length and width), the result of the k-Means algorithm can be plotted on an xy-coordinate system to show the extent of each cluster. It’s easy to visually inspect the assignment to see if the k-Means analysis returned a meaningful insight. In more dimensions (e.g., length, width, and height) you will need to either create a 3D plot, summarize your features in a table, or find another alternative to describing your analysis. This loses the intuitive power that a 2D k-Means analysis has in convincing you or your audience that your analysis should be trusted. It’s not to say that your analysis is wrong; it simply takes more mental focus to understand what your analysis says.

The k-Means analysis, however, is not always the best choice. k-Means does well on data that naturally falls into spherical clusters. If your data has a different shape (linear, spiral, etc.), k-Means will force clustering into circles, which can result in outputs that defy human expectations. The algorithm is not wrong; we have fed the algorithm data it was never intended to understand.

There’s a lot of depth in this article which makes it really interesting.

Comments closed

The Basics of Graph Theory

Ernest Martinez gives us a primer on graph theory, as well as a few interesting use cases:

We used a new option in the Oracle database called Spatial Data Option, which allowed us to do multi-dimensional queries based on geographic location and perform shortest path queries in SQL. To access the latitude and longitude of every zip code in the country, we purchased a list of zip code centroids from the US Post Office. We then joined the centroid zip with a store’s zip which gave us an approximate cartesian coordinate for the store.

The first customer to purchase this product was a national muffler company. We POC’d (proof of concept) it initially in the NYC area. The first problem we encountered was that the shortest distance between point A and point B wasn’t necessarily the right answer. For example, to a person living on the north shore of Long Island the nearest shop, as the crow flies, was in Connecticut, across the Long Island Sound. Unless they had a boat, this was definitely not their closest shop. Obviously we needed to introduce cost functions into our algorithms. A high cost across the sound resolved the issue.

Click through for more info and a few stories.

Comments closed

Adding Images to Excel using Powershell

Mikey Bronowski continues a series on working with Powershell:

This is part of the How to Excel with PowerShell series. Links to all the tips can be found in this post.
If you would like to learn more about the module with an interactive notebook, check this post out.

Spreadsheets’ main purpose is data: storing, manipulating and analyzing them. We can add some colours or charts to make the data more friendly, but sometimes we may want to add something else – like a logo or picture and all that can be achieved with PowerShell.

Read on to see how you can lay out an image or add shapes to a spreadsheet.

Comments closed

Deploying Azure Data Services via Terraform

Chris Adkin has started a series on deploying Azure Arc enabled Data Services. Part 1 serves as an introduction

:One of the most significant things to change the landscape for Azure data professionals will be general release of Azure Arc enabled Data Services. To provide an expedient means of experiencing all that Azure Arc has to offer, Microsoft has come up with Jumpstart – a collection of GitHub repos for deploying Arc in different scenarios. Last Christmas I had a few vacation days and took the opportunity to try out Jumpstart for Azure Arc enabled data services on AWS. AWS was my choice because it made a certain amount of sense to try out Azure Managed SQL Server instances and Postgres Hyperscale on a cloud that they are not natively available on. After all, the whole point of Azure Arc enabled Data Services is to bring Azure to you on your terms if for any reason you cannot use the Azure cloud. 

Part 2 gives us an introduction to Terraform:

Before diving into what the various Terraform modules do that make up the Arc-PX-VMware-Faststart repo, I’m going to provide an introduction to Terraform in this blog post. Terraform comes from Hashicorp, it is a tool that works on the principle of infrastructure-as-code. Resources are specified in what are called configuration files using Hashicorp Control Language in a declarative manner, i.e. you state what you want and to the best of its ability Terraform attempts to create those resources for you. ‘Providers’ are used to create resources for particular types of entity, for example you might use local file, helm (the Kubernetes package manager), Azure, VMware providers etc. etc. . . . Using providers requires plugins, most of which are provided by Hashicorp, but third parties can write their own plugins also.

Check out the first two posts in what promises to be an interesting series.


Performance Gains with APPLY

Erik Darling gives us a scenario where OUTER APPLY is quite useful:

In all, this query will run for about 18 seconds. The majority of it is spent in a bad neighborhood.

Why does this suck? Boy oh boy. Where do we start?

– Sorting the Votes table to support a Merge Join?
– Choosing Parallel Merge Joins ever?
– Choosing a Many To Many Merge Join ever?
– All of the above?

I voted “all of the above.” Click through to see how Erik turns a bad query plan into a much less bad query plan.

Comments closed

Bad Request when Debugging an Azure Data Factory Pipeline

Ed Elliott ran into a problem:

Now, whenever I am troublehooting something in Azure and I come to the activity logs I am always hopeful but also always dissapointed that they don’t show more details. The bit that really annoys me is that I know Micrsoft see more detailed error information as I have been screen sharing with a support tech who used log exporer to see more detailed error messages than I see – grrrr, just show us the data! Anyway, I digress – so in the activity log, does it give a clue as to what is wrong?

No, in a word no it doesn’t. 

Read on for the conclusion, which rates as “Should have been an easy fix but the error message was completely unhelpful.”

Comments closed

The Importance of LSNs to SQL Server

Jack Vamvas explains a concept:

I was talking to an Auditor recently – who specialises in large Corporate Audits – and they asked me how would I prove a certain database which is backed up is actually restored to another server.  One of the methods I described was using the Log Sequence Numbers (LSN).     

Read on for an explanation of how they work and how you can use LSNs to solve that auditing issue.

Comments closed

Keeping msdb Clean

Eitan Blumin takes the data janitor role seriously:

As part of its regular, ongoing, day-to-day activities, your SQL Server instance would naturally collect historical data about its automated operations. If left unchecked, this historical data could pile up, leading to wasted storage space, performance hits, and even worse issues.

MSDB would obviously be collecting data about the SQL Agent job executions. But there are also a few other types of historical data that needs to be cleaned up once in a while. In this blog post, I hope to cover all bases and leave no historical data un-cleaned.

Read on for several data sources which you’ll want to keep tidy.

Comments closed