Press "Enter" to skip to content

Month: January 2022

When Not to Use Apache Kafka

Kai Waehner looks at when we may (or may not) want to use Apache Kafka:

Apache Kafka is the de facto standard for event streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka.

I appreciate this kind of post a lot, especially from someone directly invested in the product. No technology can or should fit all purposes and the better you can explain where something does not fit, the better you can explain where it does fit.

Comments closed

Log Analytics and Power BI

Chris Webb has started a new series:

As a Power BI administrator you want to see what’s happening in your tenant right now: who’s running queries, which datasets are refreshing and so on. That way if a user calls you to complain that their report is slow or their dataset hasn’t refreshed yet you can start troubleshooting immediately. Power BI’s integration with Log Analytics (currently in preview with some limitations) is a great source of information for this kind of troubleshooting: it gives you the ability to send various useful Analysis Services engine events, events that give you detailed information about queries and refreshes among other things, to Log Analytics with a latency of only a few minutes. Once you’ve done that you can write KQL queries to understand what’s going on, but writing queries is time consuming – what you want, of course, is a Power BI report.

Click through to see how to use Power BI to access KQL data in Log Analytics, which you’re using to monitor Power BI behavior.

Comments closed

Azure Data Factory Activity Queue Times

Meagan Longoria waits in line:

I’ve been working on a project to populate an Operational Data Store using Azure Data Factory (ADF). We have been seeking to tune our pipelines so we can import data every 15 minutes. After tuning the queries and adding useful indexes to target databases, we turned our attention to the ADF activity durations and queue times.

Data Factory places the pipeline activities into a queue, where they wait until they can be executed. If your queue time is long, it can mean that the Integration Runtime on which the activity is executing is waiting on resources (CPU, memory, networking, or otherwise), or that you need to increase the concurrent job limit.

Click through to see how you can calculate queue times across activities, pipelines, and data factories.

Comments closed

Automatic Plan Correction in Query Store

Deepthi Goguri hits on the type of benefit Query Store can provide:

How wonderful will that be if SQL Server has a way of automatically tune our Queries based on our workloads, amazing! Right?

Thanks to Microsoft for introducing the automatic tuning feature in SQL Server 2017 and available in Azure SQL Database. Automatic tuning has two features. Automatic plan correction and Automatic index correction (Source: Microsoft)

So, what is this automatic option, and how it works?

Click through to learn more. My experience with it has been very positive. It’s not perfect, but it does work really well.

Comments closed

Macros in Tabular Editor 3

Matt Allington notes a key feature in Tabulor Editor 3:

Today I am talking about Macros in Tabular Editor 3. This is a new name for an old feature. In Tabular Editor 2, this feature is called Advanced Scripting (a term I actually prefer, but oh well).  I think one reason for the name change is there are now multiple types of scripting, including the new DAX scripting feature (I covered that as a key feature I love in the article linked above).

Click through to see how it works. Tabular Editor 3 is a paid product, though the free Tabular Editor 2 is still around if your employer won’t front the cash for 3.

Comments closed

Addressable Disk Space and File Counts in SQL MI General Purpose

Niko Neugebauer has been busy:

In the previous blog posts in the SQL MI How-Tos we have already touched on the aspect of SQL MI reserved and available Disk Space, but as in everything – there is so many things to add and expand. In this post we shall focus on the General Purpose service tier and the remote disk storage that is used in this service tier. Besides the explicit limits of the addressable space that is connected to the number of CPU vCores, there are important aspects of the remote storage that will limit the number of database files that can be located there.

If you are interested in other posts on how-to discover different aspects of SQL MI – please visit the  http://aka.ms/sqlmi-howto, which serves as a placeholder for the series.

Click through to see how it all fits together with Managed Instances.

Comments closed

Improving Apache Flink Scheduler Performance

Zhilong Hong, et al, share some interesting results out of Apache Flink 1.14. Part one lays out the scene:

To estimate the effect of our optimizations, we conducted several experiments to compare the performance of Flink 1.12 (before the optimization) with Flink 1.14 (after the optimization). The job in our experiments contains two vertices connected with an all-to-all edge. The parallelisms of these vertices are both 10K. To make temporary deployment descriptors distributed via the blob server, we set the configuration blob.offload.minsize to 100 KiB (from default value 1 MiB). This configuration means that the blobs larger than the set value will be distributed via the blob server, and the size of deployment descriptors in our test job is about 270 KiB. The results of our experiments are illustrated below:

Part two explains their improvements:

In Flink 1.12, the ExecutionEdge class is used to store the information of connections between tasks. This means that for the all-to-all distribution pattern, there would be O(n2) ExecutionEdges, which would take up a lot of memory for large-scale jobs. For two JobVertices connected with an all-to-all edge and a parallelism of 10K, it would take more than 4 GiB memory to store 100M ExecutionEdges. Since there can be multiple all-to-all connections between vertices in production jobs, the amount of memory required would increase rapidly.

As we can see in Fig. 1, for two JobVertices connected with the all-to-all distribution pattern, all IntermediateResultPartitions produced by upstream ExecutionVertices are isomorphic, which means that the downstream ExecutionVertices they connect to are exactly the same. The downstream ExecutionVertices belonging to the same JobVertex are also isomorphic, as the upstream IntermediateResultPartitions they connect to are the same too. Since every JobEdge has exactly one distribution type, we can divide vertices and result partitions into groups according to the distribution type of the JobEdge.

Click through for a dive into the architecture.

Comments closed

Multiple Code Panes in R Studio

Tomaz Kastrun has good news for us:

On R studio home page, make sure to download the version 2021.09 Preview (as of writing of this blogpost, this is still in preview) and install this version on your client machine (supported windows machine, MacOS and Linux).

Once installation is completed, head to global options (Tools->Global options) and select Pane Layout. You will have a new set of buttons available (Add Column; Remove Column). With Add column an additional pane will be added to layout.

It’s not as convenient as the right-click –> “Split horizontally” or “Split vertically” that we get in tools like SSMS and VS Code, but I’m happy to see this change in R Studio.

Comments closed

Replication Error 20084 on SQL Server 2019

I ran into a weird issue:

Iwas helping out with a SQL Server upgrade recently, going from 2016 to 2019. We ran into a problem when trying to run replmerg.exe for a merge replication subscription. Specifically, we were getting error code 20084, which means that the replication process couldn’t connect to one of the instances. Interestingly, the process couldn’t connect to the local instance, and the failure was immediate—that is, within a couple of milliseconds. There was nothing in the management logs on either the distributor server or the subscriber server which indicated a problem. We were able to connect both sides together just fine—from the subscriber, we could connect to the distributor, and from the distributor, we could connect to the subscriber.

Click through for what error code 20084 typically means, as well as what turned out to be the problem here.

Comments closed