Press "Enter" to skip to content

Author: Kevin Feasel

K-Means and K-Medoids Clustering

Niti Sharma explains two clustering algorithms:

K-means and k-medoids are methods used in partitional clustering algorithms whose functionality works based on specifying an initial number of groups or, more precisely, iteratively by reallocation of objects among groups.

The algorithm works by first segregating all the points into an already selected number of clusters. The process is carried out by measuring the distance between the point and the center of each cluster. And because k-means can function only in the Euclidean space, the functionality of the algorithm is limited. Despite the drawbacks or shortcomings of algorithm possesses, k-means is still one of the most powerful tools used in clustering. The applications can be seen widely used in multiple fields – physical sciences, natural language processing (NLP), and healthcare.

k-means is a fairly common algorithm, but you hear less about k-medoids—it’s the more robust alternative to k-means.

Comments closed

Reporting on Correlation Analysis in R

Petr Baranovskiy continues a series on correlation analysis using R:

This is the second part of the Correlation Analysis in R series. In this post, I will provide an overview of some of the packages and functions used to perform correlation analysis in R, and will then address reporting and visualizing correlations as text, tables, and correlation matrices in online and print publications.

Read the whole thing.

Comments closed

The Production-Readiness of Azure Synapse Analytics

Paul Andrew casts some harsh light:

While I completely share and actually like Microsoft’s vision of an analytics resource…

“that brings together data integration, enterprise data warehousing and big data analytics”

https://azure.microsoft.com/en-gb/services/synapse-analytics/

… the marketing, hype and technical implementation have resulted in a lot of confusion and disappointment.

So, to answer the title of this blog post directly. My opinion, as I write on 29th January 2021, is: NoAzure Synapse Analytics is not ready. Sorry Microsoft, but you’ve had long enough. I can’t hold back the questions and demands from customers anymore on why Synapse still isn’t included in my architecture diagrams.

Paul raises many good points, and the positive takeaway is that these are fixable issues. But as of today, they are definitely things you want to consider before jumping in.

Comments closed

Changing IP Addresses in an Availability Group

Sreekanth Bandarla is ready to make a change:

In this blog post, let’s see how to change all the IP addresses involved in a typical Always on Availability group configuration. In my setup, I have an AG with two replicas and a listener. See below to get an idea of my current environment on which I am going to change all the underlying IP addresses.

Click through for a step-by-step process, as well as a few things to remember.

Comments closed

Combining Azure Synapse Analytics and Azure Purview

Wolfgang Strasser shows how we can integrate Azure Synapse Analytics with Azure Purview:

In the past months I had the chance to play with and build solutions based on Azure Synapse Analytics and Azure Purview.

Azure Synapse (my Synapse blog entries) as the foundation for a solid platform to store, analyze and build data solutions and Azure Purview (my Purview blog posts) as the data governance and data catalog solution in Azure.

During the writing of my latest blog post (What’s new in Azure Synapse Analytics?), I found a very interesting entry in the update feature list: Azure Purview Integration.

Read on to see how.

Comments closed

Power Query Folding Indicators

Matthew Roche points out a nice addition to Power Query:

Because of the performance benefit that query folding provides, experienced query authors are typically very careful to ensure that their queries take advantage of the capabilities of their data sources, and that they fold as many operations as possible. But for less experienced query authors, telling what steps will fold and which will not has not always been simple…

Until now.

Read on for more information. I saw this for the first time in a recent presentation and was pleasantly surprised at how well it works.

Comments closed

Importing Graph Data into SQL Server

Louis Davidson takes us through an interesting problem:

The problem was, if I wanted to recreate this graph in data, I had to type in a bunch of SQL statements (something I generally enjoy to a certain point, but one of my sample files cover the geography of Disney World, and it would take a very long time to manually type that into a database as it took quite a while just to do one section of the park). 

So I went hunting for a tool to do this for me, but ended right back with yEd. The default file type when you save in yEd is GraphML, which is basically some pretty complex XML that was well beyond my capabilities using XML in SQL or Powershell. Realistically I don’t care that much about anything other than just the nodes and edges, and what I found was that you can save graphs in the tool a format named Trivial Graph Format (TGF).

Click through to see it in action.

Comments closed

Model Post-Processing with insight

The easystats team talks about the insight package in R:

We are talking about the insight package. It is what allows other packages, like easystats (parameterseffectsizeperformancereport, …) or ggstatsplotsjstats or modelsummary to be as powerful as they are, supporting tons of different R models. So why make you life hard when you can be like them, and rely on insight?

It is made for developers (and users) that do some postprocessing of different models (e.g., extracting stuff like parameters, values, data, names, specifications, predictions, priors, etc.), whether it is to nicely display their results or to do further computation.

Click through for an example of what it does and how it works. H/T R-bloggers

Comments closed

Determining a Good Test Set Size

John Mount thinks about test set size:

In this note we will answer “what is a good test set size?” three ways.

– The usual practical answer.
– A decision theory answer.
– A novel variational answer.

Each of these answers is a bit different, as they are solved in slightly different assumed contexts and optimizing different objectives. Knowing all 3 solutions gives us some perspective on the problem.

My rule of thumb is that I want it to be as small as possible while containing the highest likelihood of hitting all real-world scenarios enough times to provide a valid comparison. This conversely maximizes the size of the training data set, giving us the best chance of seeing the widest variety of scenarios we can during the formative phase.

And as usual, John goes way deeper than my rules of thumb. I like this post a lot.

Comments closed