Press "Enter" to skip to content

Month: October 2023

Visualizing Kusto Graphs with Plotly and Python

Henning Rauch creates some plots:

Graphs are a powerful way to model and analyse complex relationships between entities, such as cybersecurity incidents, network traffic, social networks, and more. Kusto, the query and analytics engine of Azure Data ExplorerMicrosoft Fabric Real-Time Analytics and many more recently introduced a new feature that enables users to contextualize their data using graphs. In this blog post, we will show you how to use graph semantics to create and explore graph data in Kusto, and how to visualize it using Plotly, a popular library for interactive data visualization in Python.

Graph semantics are a set of operators that allow users to work with graph data in Kusto, without the need to use a separate graph database or framework.

Click through for the KQL you’ll need, as well as how to display that in Plotly.

Comments closed

Drawing Horizontal Box Plots in R

Steven Sanderson is not limited to one axis:

Boxplots are a great way to visualize the distribution of a numerical variable. They show the median, quartiles, and outliers of the data, and can be used to compare the distributions of multiple groups.

Horizontal boxplots are a variant of the traditional boxplot, where the x-axis is horizontal and the y-axis is vertical. This can be useful for visualizing data where the x-axis variable is categorical, such as species or treatment group.

Click through for an example using base R and ggplot2.

Comments closed

Reshaping Records using cdata

John Mount takes us through a common data wrangling problem:

In many data science projects we have the data, but it “is in the wrong format.” Fortunately re-formatting or reshaping data is a solved problem, with many different available tools.

For this note, I would like to show how to reshape data using the data algebra‘s cdata data reshaping tool. This should give you familiarity with a tool to use on your own data.

Click through for an example in Python. Mount and Nina Zumel also have an R package for cdata.

Comments closed

Safe Mode for Updates in MySQL

Chad Callihan is no dummy:

Did you know MySQL has a flag designed to prevent accidentally changing more data than you intended? If not, I think you’ll find it easy to remember as the flag has a memorable name: ‘i-am-a-dummy.’ If you have this flag set and leave off a WHERE clause when updating or deleting data, MySQL will prevent the statement from executing.

Let’s walk through an example using i-am-a-dummy and its “Safe Updates” Workbench counterpart.

Seems like this should be on by default for most servers.

Comments closed

SQL Server 2022 on SuSE Enterprise Linux Now Available in Azure Marketplace

Arvind Mahadevan has an announcement:

We are pleased to announce that we have worked with both SUSE and Microsoft engineering teams to release the latest SQL Server 2022 on SLES v15 SP5 Azure Marketplace image. This is in alignment with our goal to have the latest SQL Server on Linux Azure Marketplace images.

It’s a short post but does give us an idea of where they’re at on Linux support. Support for Ubuntu 22.04 is still in preview, so I’d expect that to come out soon as well.

Comments closed

String Regularization and Tokenization in SQL Server

Aaron Bertrand saves some space:

The Stack Exchange network logs a lot of web traffic – even compressed, we average well over a terabyte per month. And that is just a summarized cross-section of our overall raw log data, which we load into a database for downstream security and analytical purposes. Every month has its own table, allowing for partitioning-like sliding windows and selective indexes without the additional restrictions and management overhead. (Taryn Pratt talks about these tables in great detail in her post, Migrating a 40TB SQL Server Database.)

It’s no surprise that our log data is massive, but could it be smaller? Let’s take a look at a few typical rows. While these are not all of the columns or the exact column names, they should give an idea why 50 million visitors a month on Stack Overflow alone can add up quickly and punish our storage:

Click through for one technique Aaron has to tighten things up a bit.

Comments closed

Distinct Counts in Power BI and KQL

Dany Hoter needs a distinct count:

Calculating distinct counts on massive distributed datasets is not trivial.

Kusto (aka Azure Data Explorer/KQL database in Fabric) dcount and dcountif functions use a special algorithm to return an estimate of distinct counts

The new functions count_distinct and count_distinctif were recently added to calculate exact distinct counts. These two functions are much more expensive than the original ones.

Read on for more details on how this all works.

Comments closed

Apache Kafka Consumer Group Strategy

Lucia Cerchie gives us some advice:

Ever dealt with a misbehaving consumer group? Imbalanced broker load? This could be due to your consumer group and partitioning strategy! 

Once, on a dark and stormy night, I set myself up for this error. I was creating an application to demonstrate how you can use Apache Kafka® to decouple microservices. The function of my “microservices” was to create latte objects for a restaurant ordering service. It was set up a little like this:

I wanted to implement this in Kafka by using consumers, each reading from a common coffee topic, but with their own partition. Now this was a naive approach. Why? 

Click through to learn the reason, as well as some of the mechanics of how consumer groups work.

Comments closed

Plotting Decision Trees in R

Steven Sanderson builds a tree:

Decision trees are a powerful machine learning algorithm that can be used for both classification and regression tasks. They are easy to understand and interpret, and they can be used to build complex models without the need for feature engineering.

Once you have trained a decision tree model, you can use it to make predictions on new data. However, it can also be helpful to plot the decision tree to better understand how it works and to identify any potential problems.

In this blog post, we will show you how to plot decision trees in R using the rpart and rpart.plot packages. We will also provide an extensive example using the iris data set and explain the code blocks in simple to use terms.

Read on to see an example of how to do this.

Comments closed

Data Center Staffing Disasters

Steve Jones reads an after-action report:

There was a failure recently at an Azure data center in Australia when a utility power sag caused equipment to trip offline at one of the Azure data centers in Australia. You can read about it here, but essentially the headline is that there were only three people on site when the incident occurred, and that caused them to be unable to restart the equipment in time before an outage occurred.

Read on to learn more about why this failed and what Steve has seen in the wild.

Comments closed