2022-06-15 – Curated SQL

An Overview of Clustering Algorithms

Published 2022-06-15 by Kevin Feasel

Gavita Regunath has a two-parter on clustering. First, an explanation of the concept:

Clustering, or cluster analysis, is an unsupervised machine learning method. As the name implies, unsupervised machine learning refers to how the model ‘learns’ the data. It is a learning process opposite to supervised learning. With supervised learning, models are trained or “supervised” using labelled datasets (a known function output to our data). An example of a supervised learning method is where a model is trained to recognise animals based on their labels of being a cat, dog and rabbit.
Unsupervised learning works with unlabelled data where there are no known function outputs, and the aim is to identify patterns within a dataset. There are many unsupervised learning algorithms, however, the three main types are clustering algorithms, dimensionality reduction and anomaly detection. The focus of this blog will be on clustering, as it is the most commonly used unsupervised learning technique.

Second, a review of ten clustering algorithms:

There are many clustering algorithms. In fact, there are more than 100 clustering algorithms that have been published so far. However, despite the various types of clustering algorithms, they can generally be categorised into four methods. Let’s look at these briefly:

Read on to learn more about clustering.

Comments closed

Generating an Expression Variable for Joins with PySpark

Published 2022-06-15 by Kevin Feasel

Unmesha Sreeveni uses a variable to effect a join in PySpark:

Lets see how to join 2 table with a parameterized on condition in PySpark
Eg: I have 2 dataframes A and B and I want to join them with id,inv_no,item and subitem

Click through to see how. It turns out to be pretty straightforward.

Comments closed

Pull Request Standards

Published 2022-06-15 by Kevin Feasel

Deb Melkin stands in front of the gate:

When I think of what my coding standards are, I tend to think of it as how would I review scripts for a pull request (PR). I think my past coworkers can attest that I can get quite picky when I look at code. So if I were your reviewer, what sort of things would I comment on?

Click through for a checklist.

Comments closed

Tracking Table Updates via SQL Audit

Published 2022-06-15 by Kevin Feasel

Tracy Boggiano wants to figure out who keeps taking her lunch out of the company refrigerator:

I had a problem at work recently where a record was getting updated, and no one knew where or what was updating the record. Our team discussed the best way to try to figure out what was happening. The situation was if a record would be updated to active and within a ten-minute window, the record would be set back to inactive. The system allows ad-hoc statements to run against and since it was to only a certain table, I suggested we set up a SQL Audit to track UPDATEs to the table. The code for this is fairly simple, but since most of my colleagues don’t have exposure to SQL Audit, I figured a blog post would benefit others.
So, in this case, we are creating a Server SQL Audit that will write to D:\SQL Audit, so make sure that path exists. Then a Database Server Audit Specification to track any UPDATEs that happen to the table. Now, keep in mind I choose the method over running a server-side Trace or Extended Events because I knew it would capture everything without me having to worry about setting up anything else put these commands. An important part of this is where I specify “public”. That tells the audit to capture anybody that is updating the table. If you want to look for a certain user or even maybe someone part of a role, you could specify that instead.

Click through for the auditing script. I wish this type of information were a lot easier to get, especially for longer-term audits. I end up creating metadata columns (created/modified user, created/modified date) but that gives limited information and requires all calling code play along.

Comments closed

Enumerating Azure Storage Replication Types

Published 2022-06-15 by Kevin Feasel

Arun Sirpal has a list:

Storage Accounts are pretty much integrated into so many different designs in Azure, whether you are using Azure Synapse, 3^rd party product like Snowflake, or Event Streaming designs – we need it.
When you create a storage account there are 5 different replication types you should know about. These are LRS, ZRS, GRS, RA-GRS and GZRS. Lots of abbreviations here, lets explain further.

Read on for the explanation.

Comments closed

Finding Assigned Synapse RBAC Roles in Powershell

Published 2022-06-15 by Kevin Feasel

Charith Caldera wants to know your access level:

One of the key use cases that most customers face difficulties while retrieving or assigning the role-based access control in Azure Synapse Analytics, that they cannot find the correct usernames, group names or the service principal names using the PowerShell cmdlet “Get-AzSynapseRoleAssignment”. The PowerShell cmdlet only provides limited information and it’s difficult to understand since that contains the object IDs.

Read on for a script which helps with this problem.

Comments closed

Using DATE_BUCKET() in SQL Server

Published 2022-06-15 by Kevin Feasel

Hasan Savran starts bucketing:

The Date_Bucket function is introduced in Azure SQL Edge which is mainly used by IoT devices. This useful function returns the date-time value corresponding to the start of each date-time bucket from the timestamp defined by the origin parameter, or the default origin value of 1900-01-01 00:00:00.000. In other words, it lets you arrange data into groups that represent fixed intervals of time. SQL Server 2022 includes this useful function in its database engine.
Results of Data_Bucket might be confusing, Let’s look at its syntax first.

Hasan is quite right here: the results of DATE_BUCKET() are not intuitive, though they do make some sense…eventually…

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Day: June 15, 2022

An Overview of Clustering Algorithms

Generating an Expression Variable for Joins with PySpark

Pull Request Standards

Tracking Table Updates via SQL Audit

Enumerating Azure Storage Replication Types

Finding Assigned Synapse RBAC Roles in Powershell

Using DATE_BUCKET() in SQL Server