Press "Enter" to skip to content

Curated SQL Posts

Using the tree Command

Denis Gobo learns a new trick:

I was watching a Pluralsight course and the person typed in the tree command.. and I was like whoaaaa.. How do I not know this?  Perhaps maybe because I don’t use the command window all that much?  Anyway I thought that this was pretty cool

As you can see tree list all the directories and sub directories in a tree like structure. This is great to quickly see all the directories in one shot

It’s a useful command. And if you’re on Linux, there are a lot of useful switches. If you’re on Windows, there are fewer useful switches.

Comments closed

Deploying Azure Data Services via Terraform

Chris Adkin has two additional parts of a series. Part 3 shows us how to deploy a virtual machine on VMware:

To do this you require an Ubuntu virtual machine, I’ve tested this with Ubuntu 18.04 LTS and I will get around to testing it with Ubuntu 20.10 at some stage. If for example the virtual machine was created with a user called azuser, the deployment server should also have an azuser account under which all Terraform commands are executed. 

Part 4 takes those VMs and set up a Kubernetes cluster across them:

Whatever you do when deploying a Kubernetes cluster, somewhere along the line you have to use kubeadm. There is a wealth of material available on blog posts and on the internet in general in which people roll there own scripts using kubeadm. I often suspect that many of these efforts are the result of Kelsey Hightower’s: Kubernetes the hard wayIn this post we are emphatically going to do things the easy way, […]

And now we’re caught up on the series…for the moment, at least.

Comments closed

Creating Parquet Files from SQL Server Data

Andy Leonard answers a challenge:

I searched and found some promising Parquet SSIS components available from CData Software and passed that information along. I shared my inexperience in exporting to parquet format and asked a few friends how they’d done it.

I thought: How many times have I demonstrated Azure Data Factory and clicked right past file format selection without giving Parquet a second thought? Too many times. It was time to change that.

Another route is to use PolyBase. If you’re okay with writing the results to Azure Blob Storage, you can insert directly into Parquet files the results of a SQL query. If that sounds interesting, here are posts on connecting to Azure Blob Storage via PolyBase and inserting into Azure Blob Storage. I insert in CSV format to make it easier for people to follow, but swap the file format with Parquet and it works all the same.

Comments closed

T-SQL Tuesday 136 Wrap-Up

Brent Ozar rounds up the usual suspects, plus several more:

For this month’s T-SQL Tuesday, I asked you to blog about your most-loved and least-loved data types.

Crazy, right? How could people possibly love or hate data types? Well, if you’ve been working with them for a while, I figured you’d have built up an array of tips or pain points, and y’all delivered with 29 interesting blog posts.

Click through for a lengthy list of interesting posts.

Comments closed

The Basics of k-Means Clustering

Nathaniel Schmucker explains some of the principles of k-means clustering:

k-Means is easy to implement. In R, you can use the function kmeans() to quickly deploy an efficient k-Means algorithm. On datasets of reasonable size (thousands of rows), the kmeans function runs in fractions of a second.

k-Means is easy to interpret (in 2 dimensions). If you have two features of your k-Means analysis (e.g., you are grouping by length and width), the result of the k-Means algorithm can be plotted on an xy-coordinate system to show the extent of each cluster. It’s easy to visually inspect the assignment to see if the k-Means analysis returned a meaningful insight. In more dimensions (e.g., length, width, and height) you will need to either create a 3D plot, summarize your features in a table, or find another alternative to describing your analysis. This loses the intuitive power that a 2D k-Means analysis has in convincing you or your audience that your analysis should be trusted. It’s not to say that your analysis is wrong; it simply takes more mental focus to understand what your analysis says.

The k-Means analysis, however, is not always the best choice. k-Means does well on data that naturally falls into spherical clusters. If your data has a different shape (linear, spiral, etc.), k-Means will force clustering into circles, which can result in outputs that defy human expectations. The algorithm is not wrong; we have fed the algorithm data it was never intended to understand.

There’s a lot of depth in this article which makes it really interesting.

Comments closed

The Basics of Graph Theory

Ernest Martinez gives us a primer on graph theory, as well as a few interesting use cases:

We used a new option in the Oracle database called Spatial Data Option, which allowed us to do multi-dimensional queries based on geographic location and perform shortest path queries in SQL. To access the latitude and longitude of every zip code in the country, we purchased a list of zip code centroids from the US Post Office. We then joined the centroid zip with a store’s zip which gave us an approximate cartesian coordinate for the store.

The first customer to purchase this product was a national muffler company. We POC’d (proof of concept) it initially in the NYC area. The first problem we encountered was that the shortest distance between point A and point B wasn’t necessarily the right answer. For example, to a person living on the north shore of Long Island the nearest shop, as the crow flies, was in Connecticut, across the Long Island Sound. Unless they had a boat, this was definitely not their closest shop. Obviously we needed to introduce cost functions into our algorithms. A high cost across the sound resolved the issue.

Click through for more info and a few stories.

Comments closed

Adding Images to Excel using Powershell

Mikey Bronowski continues a series on working with Powershell:

This is part of the How to Excel with PowerShell series. Links to all the tips can be found in this post.
If you would like to learn more about the module with an interactive notebook, check this post out.

Spreadsheets’ main purpose is data: storing, manipulating and analyzing them. We can add some colours or charts to make the data more friendly, but sometimes we may want to add something else – like a logo or picture and all that can be achieved with PowerShell.

Read on to see how you can lay out an image or add shapes to a spreadsheet.

Comments closed

Deploying Azure Data Services via Terraform

Chris Adkin has started a series on deploying Azure Arc enabled Data Services. Part 1 serves as an introduction

:One of the most significant things to change the landscape for Azure data professionals will be general release of Azure Arc enabled Data Services. To provide an expedient means of experiencing all that Azure Arc has to offer, Microsoft has come up with Jumpstart – a collection of GitHub repos for deploying Arc in different scenarios. Last Christmas I had a few vacation days and took the opportunity to try out Jumpstart for Azure Arc enabled data services on AWS. AWS was my choice because it made a certain amount of sense to try out Azure Managed SQL Server instances and Postgres Hyperscale on a cloud that they are not natively available on. After all, the whole point of Azure Arc enabled Data Services is to bring Azure to you on your terms if for any reason you cannot use the Azure cloud. 

Part 2 gives us an introduction to Terraform:

Before diving into what the various Terraform modules do that make up the Arc-PX-VMware-Faststart repo, I’m going to provide an introduction to Terraform in this blog post. Terraform comes from Hashicorp, it is a tool that works on the principle of infrastructure-as-code. Resources are specified in what are called configuration files using Hashicorp Control Language in a declarative manner, i.e. you state what you want and to the best of its ability Terraform attempts to create those resources for you. ‘Providers’ are used to create resources for particular types of entity, for example you might use local file, helm (the Kubernetes package manager), Azure, VMware providers etc. etc. . . . Using providers requires plugins, most of which are provided by Hashicorp, but third parties can write their own plugins also.

Check out the first two posts in what promises to be an interesting series.

2 Comments

Performance Gains with APPLY

Erik Darling gives us a scenario where OUTER APPLY is quite useful:

In all, this query will run for about 18 seconds. The majority of it is spent in a bad neighborhood.

Why does this suck? Boy oh boy. Where do we start?

– Sorting the Votes table to support a Merge Join?
– Choosing Parallel Merge Joins ever?
– Choosing a Many To Many Merge Join ever?
– All of the above?

I voted “all of the above.” Click through to see how Erik turns a bad query plan into a much less bad query plan.

Comments closed

Bad Request when Debugging an Azure Data Factory Pipeline

Ed Elliott ran into a problem:

Now, whenever I am troublehooting something in Azure and I come to the activity logs I am always hopeful but also always dissapointed that they don’t show more details. The bit that really annoys me is that I know Micrsoft see more detailed error information as I have been screen sharing with a support tech who used log exporer to see more detailed error messages than I see – grrrr, just show us the data! Anyway, I digress – so in the activity log, does it give a clue as to what is wrong?

No, in a word no it doesn’t. 

Read on for the conclusion, which rates as “Should have been an easy fix but the error message was completely unhelpful.”

Comments closed