Press "Enter" to skip to content

Month: May 2022

Monitoring Streaming Queries in PySpark

Hyukjin Kwon, et al, lay out some monitoring advice:

Streaming is one of the most important data processing techniques for ingestion and analysis. It provides users and developers with low latency and real-time data processing capabilities for analytics and triggering actions. However, monitoring streaming data workloads is challenging because the data is continuously processed as it arrives. Because of this always-on nature of stream processing, it is harder to troubleshoot problems during development and production without real-time metrics, alerting and dashboarding.

Read on to see how you can use the Observable API for alerting in PySpark—previously, it had been a Scala-only API.

Comments closed

Projecting (Selecting) Results with KQL

Robert Cain continues a series on the Kusto Query Language:

So far in my Fun With KQL series, we have used the column tool, found on the right side of the output pane and described in my original post Fun With KQL – The Kusto Query Language, to arrange and reduce the number of columns in the output.

We can actually limit the number of columns, as well as set their order, right within our KQL query. To accomplish this we use the project operator.

Read on for several good uses of the project operator.

Comments closed

Distributed Transactions in T-SQL

Kevin Wilkie explains what distributed transactions are and why you probably don’t want to use them:

In the version of transactions that we going to discuss today, we’re going to discuss doing transactions on multiple servers!

A Distributed transaction is defined by HazelSet to be “a set of operations on data that is performed across two or more data repositories”. In even simpler terms, it’s a command run against data on more than one server.

Click through for the warnings about what might possibly go wrong.

Comments closed

Fun with Nested Loops

Jared Poche explains my favorite type of join:

Nested loops joins are the join operator you are likely to see the most often. It tends to operate best on smaller data sets, especially when the first of the two tables being joined has a small data set.

In row mode, the first table returns rows one at a time to the join operator. The join operator then performs a seek\scan against the second table for each row passed in from the first table. It searches that table based on the data provided by the first table, and the columns defined in our ON or WHERE clauses.

Read on for more information about nested loop joins.

Comments closed

Finding Duplicates in Type 2 SCDs

Dinesh Asanka wants to verify some Type 2 slowly changing dimension results:

As we discussed in a previous article, Implementing Slowly Changing Dimensions (SCDs) in Data Warehouses, there are three main types of slowly changing dimensions, such as Type 1, Type 2, and Type 3. Out of these Type 1 is the simple dimension where you will simply maintain only the latest version of the attribute. For example, if the employee got promoted to Senior Software Engineer from Software Engineer, you will simply overwrite the existing value to the new value so that the historical aspect is lost.

Type 2 Slowly Changing Dimensions are used to track historical data in a data warehouse. This is the most common approach in dimension. This article uses a sample database of AdventureworksDW which is the sample database for the data warehouse.

Click through for one way to compare, one which you could build using dynamic SQL.

Comments closed

When to Use a Map Visual

Mick Cisneros explains when to use map visuals:

That ubiquity has given all of us an increased familiarity with maps, as well as a deeper affinity for them. (Probably a dependence as well!) It’s natural, then, to want to use a map to visualize data that has a geographic dimension. Why not, right? There is an obvious upside: audiences are drawn to the way they look, as it’s a more memorable image than the same old bar chart or line graph. Not to mention: it’s fun to make maps!

The problem is that maps look interesting, but their very nature limits our options for visualizing data within them. Per a recent paper by Franconeri, Padilla, Shaw, et. al., here are a couple of the comparisons that people are very good at making, perceptually:

Read on for a comparison of good map versus bad map. Just because something has a geographical component doesn’t mean you should map it.

Comments closed

Model Deployment Options in Azure

Tori Tompkins enumerates ways to deploy machine learning models in Azure:

There are so many options to deploy models in Azure that is can get quite overwhelming. In this blog, we break down all the available options and consider the pros and cons of each tooling option.

Even with those, there are other approaches as well, like hosting Spark-based models in Azure Synapse Analytics, using SQL Server Machine Learning Services on an Azure SQL Managed Instance or VM running SQL Server, etc.

Comments closed

Obscure Changes in SQL Server 2022

Aaron Bertrand has a three-parter on obscure changes in SQL Server 2022. First up we have some new information:

You can get the marketing blitz from just about anywhere, and the What’s New documentation for the bigger hitters from the technical side.

But what about the changes that aren’t on the highlight reel at Build and aren’t getting all the attention from the media blitz? I’m a details person, so I get a lot of insight looking around at the little, non-headline-generating things that have changed. I’ve shown before how to sneak a peek under the hood, and I’m going to do it again today:

Then we have feature selection changes:

You will notice some changes in the Feature Selection screen. Some of the options have been consolidated; for example, you now get R, Python, and Java with MLS, instead of picking. One item has been added: SQL Server Extension for Azure. Unfortunately, this option is checked by default in the CTP 2.0 version of setup (click to enlarge):

Finally, we have execution plan changes:

As a former technical product manager for Plan Explorer, I can’t help but snoop around in what has changed in the XSD for showplan. Even though I am not the best person to actually analyze what those changes mean, and even though changes in XSD don’t necessarily reflect changes the engine can produce right now – these usually lay the groundwork for engine changes that will happen later.

There’s quite a lot of information available for those willing to dig, as Aaron shows.

Comments closed

Bounding Box Queries in Azure Data Explorer

David Giard draws boxes:

For our current project, we are capturing into ADX the location of vehicles over time. Our customer asked us to create a function that would return all vehicles that are within a given bounding box in a given time period. This is useful information when they want to know when a vehicle returns to a building, a neighborhood, or a city.

In this article, I will show how this can be accomplished using built-in functions, the limitations of those functions, and ways to overcome those limitations.

Read on for the naive approach as well as a very interesting one using S2 cells.

Comments closed