Kafka exposes hundreds of metrics. Some of them are per broker, per client, per topic, and per partition, and so the number of metrics scales up as the cluster grows. For an average-size Kafka cluster, the number of metrics very quickly bloats to the thousands.
Warning: I am about to disappoint you. You probably recognize that you realistically cannot monitor every single available metric. So you are probably hoping that in this blog post I will filter down the list of metrics to a dozen of the most critical ones, which you would then push through some generic monitoring tool, and then be done with setting up “monitoring.” However, monitoring distributed systems like Kafka is not that simple, and so there is no such list. Keep reading to understand the problems you should be solving, and how to solve them in a robust monitoring solution specifically designed for Kafka.
A common pitfall of generic monitoring tools is that they import all available metrics from a variety of systems into a metrics swamp. Even with a comprehensive list of metrics, there is a limit to what can be achieved with no Kafka context nor Kafka expertise to determine which metrics are important and which ones are not. A metrics swamp cannot produce valuable insight from the data nor provide answers to the critical business questions we asked earlier.
This is an information-dense post that you’ll want to read if you work with Apache Kafka.
As we told you about last week, Hadoop 3.0 brings two big new features that are compelling in their own right. That includes support for erasure coding, which should boost storage efficiency by 50% thanks to more efficient data replication; and YARN Federation, which should allow Hadoop clusters to scale up to 40,000 nodes.
The delivery of Hadoop 3.0 shows that open open source community is responding to demands of industry, said Doug Cutting, original co-creator of Apache Hadoop and the chief architect at Cloudera.
“It’s tremendous to see this significant progress, from the raw tool of eleven years ago, to the mature software in today’s release,” he said in a press release. “With this milestone, Hadoop better meets the requirements of its growing role in enterprise data systems.
But some of the new features in Hadoop 3.0 weren’t designed to bring immediate rewards to users. Instead, they pave the way for the Apache Hadoop community to deliver more compelling features with versions 3.1 and versions 3.2, according to Hortonworks director of engineering Vinod Kumar Vavilapalli, who’s also a committer on the Apache Hadoop project.
“Hadoop 3.0 is actually a building block, a foundation, for more exciting things to come in 3.1 and 3.2,” he said.
Click through to see some of those exciting things.
Here at New Relic, the Edge team is responsible for the pipelines that handle all the data coming into our company. We were an early adopter of Apache Kafka, which we began using to power this data pipeline. Our initial results were outstanding. Our cluster handled any amount of data we threw at it; it showed incredible fault tolerance and scaled horizontally. Our implementation was so stable for so long that we basically forgot about it. Which is to say, we totally neglected it. And then one day we experienced a catastrophic incident.
Our main cluster seized up. All graphs, charts, and dashboards went blank. Suddenly we were totally in the dark — and so were our customers. The incident lasted almost four hours, and in the end, an unsatisfactory number of customers experienced some kind of data loss. It was an epic disaster. Our Kafka infrastructure had been running like a champ for more than a year and suddenly it had ground to a halt.
This happened several years ago, but to this day we still refer to the incident as the “Kafkapocalypse.”
Ben has a couple interesting stories and some good rules of thumb for maintaining a Kafka cluster.
The obvious difference between the JDBC benchmark and the PL/SQL one is the fact that the JDBC call has to traverse a vast amount of logic, APIs, “barriers” between the JVM and the Oracle kernel before it can actually invoke the really interesting part. This includes:
- JVM overhead
- JDBC logic
- Network overhead
- Various “outer” layers inside the Oracle database
- Oracle’s API layers to get into the SQL and PL/SQL execution engines
- The actual code running in the PL/SQL engine
In Toon’s talk (which again, you should definitely watch), the examples are running SQL code, not PL/SQL code, but the results are the same. The actual logic is relatively cheap inside of the database (as we’ve seen in the PL/SQL only benchmark), but the overhead is significant when calling database logic from outside the database.
Thus: It is very important to minimise that overhead
This particular example focuses on Oracle and JDBC, but it certainly applies to other database platforms and distributed architectures.
I will assume that you already have TFS setup and are just using HTTP and want to make things a bit more secure with HTTPS. I am also assuming that you will be using port 443 for HTTPS traffic.
To update TFS to use HTTPS you need to do a couple of things:
Have a legitimate certificate installed on the server that you can bind to
Have an IP address on the server and have firewall access setup to that IP address on port 443
But there are a few more steps as well, so click through to see them all.
The first thing I had to do was reproduce the issue. So on my Windows 10 test VM I installed the Windows Subsystem for Linux, steps to do so are here and I installed the Ubuntu app.
Then, I fired up a bash shell using WSL and then I installed SQL Server on Linux for Ubuntu as documented here.
Now, I completed the installation of SQL Server on Linux using mssql-conf when that program completes it attempts to start SQL Server on Linux. BOOM! I’m able to reproduce the same error.
Looking at the error, I decided to see if I could run SQL Server on Linux from the shell as the user mssql. This would remove systemd and mssql-conf from the picture. Basically I wanted to see if I could get another, more descriptive, error to pop out.
Anthony digs out a very useful debugging tool in Linux, strace. Sadly, he’s not able to solve the problem at the moment, but at least gets us a step in the right direction.
I will show you today how you can use Management Studio or any stored procedure to query the data, stored in a csv file, located on S3 storage. I am using CSV file format as an example here, columnar PARQUET gives much better performance.
I am going to:
1. Put a simple CSV file on S3 storage
2. Create External table in Athena service over the data file bucket
3. Create linked server to Athena inside SQL Server
4. Use OPENQUERY to query the data.
Athena service is built on the top of Presto, distributed SQL engine and also uses Apache Hive to create, alter and drop tables. You can run ANSI SQL statements in the Athena query editor, launching it from the AWS web services UI. You can use complex joins, window functions and many other great SQL language features. Using Athena eliminates need for ETL because it projects your schema on the data files at the time of the query.
Standard linked server warnings apply, but sometimes you need to bridge a couple technologies.
It’s clear that before we inserted any data, we had 20 files that were in the PRECREATED state. After inserting 10 rows, we now have 18 PRECREATED files, and 2 UNDER CONSTRUCTION files, which means the In-Memory engine is populating these files, that they are “open” in terms of their CHECKPOINT status. If you don’t understand what these terms mean, please read Part 1.
But there’s one thing that doesn’t look right here: we’ve inserted data into the table, but sizeBytesUsed is still zero for the UNDER CONSTRUCTION files. Why is that?
Ned does assume that you’ve read part 1 of the series already.
Thank you to all of you for taking time to contribute. I like the suggestion made by Glenda Gable on being partners to help with accountability on our goals. If anyone feels up to this just leave a comment below, we can set up a slack channel or something similar to help each other.
There were quite a few entries this month.