Monitoring Spark And Kafka

Larry Murdock gives some hints on monitoring Kafka topics and their associated Spark jobs:

Besides alerting for the hardware health, monitoring answers questions about the health of the overall distributed data pipeline. The Site Reliability Engineering book identifies “The Four Golden Signals” as the minimum of what you need to be able to determine: latency, traffic, errors, and saturation.

Latency is the time it takes for work to happen. In the case of data pipelines, that work is a message that has gone through many systems. To time it, you need to have some kind of work unit identifier that is reflected in the metrics that happen on the many segments of the workflow. One way to do this is to have an ID on the message, and have components place that ID in their logs. Alternatively, the messaging system itself could manage that in metadata attached to the messages.

Traffic is the demand from external sources, or the size of what is available to be consumed. Measuring traffic requires metrics that either specifically mean a new arrival or a new volume of data to be processed, or rules about metrics that allow you to proxy the measure of traffic.

Errors are particularly tricky to monitor in data pipelines because these systems don’t typically error out on the first sign of trouble. Some errors in data are to be expected and are captured and corrected. However, there are other errors that may be tolerated by the pipeline, but need to be feed into the monitoring system as error events. This requires specific logic in an application’s error capture code to emit this information in a way that will be captured by the monitoring system.

Saturation is the workload consuming all the resources available for doing work. Saturation can be the memory, network, compute, or disk of any system in the data pipeline. The kinds of indicators that we discussed in the previous post on tuning are all about avoiding saturation.

Larry then applies these concepts and gives links to some useful tools.

Related Posts

Extracting Phone Numbers With Apache Tika

Unni Mana knows how to get your digits: Last time, I had difficulties detecting phone numbers from different types of documents. The challenge was that I had to use different parsers to parse and extract the phone numbers. For example, to extract phone numbers from a Word document, I had to use a library that […]

Read More

Sharepoint And MAXDOP=1

Daniel Glenn has a public service announcement on Sharepoint maximum degree of parallelism: For my first install test, I was using a service account that did have administrative rights on SQL as well. I looked at the setting in SQL and SharePoint 2016 did change the Maximum Degree Of Parallelism setting to 1. So, the […]

Read More