Press "Enter" to skip to content

Category: Hadoop

Analyzing Web Server Logs With Spark

Fisseha Berhane uses web server log analysis to contrast three methods of using Spark:

This is the third tutorial on the Spark RDDs Vs DataFrames vs SparkSQL blog post series. The first one is available here. In the first part, we saw how to retrieve, sort and filter data using Spark RDDs, DataFrames and SparkSQL. In the second part (here), we saw how to work with multiple tables in Spark the RDD way, the DataFrame way and with SparkSQL. In this third part of the blog post series, we will perform web server log analysis using real-world text-based production logs. Log data can be used monitoring servers, improving business and customer intelligence, building recommendation systems, fraud detection, and much more. Server log analysis is a good use case for Spark. It’s a very large, common data source and contains a rich set of information.

This tutorial shows you three different ways to solve several problems, including file sizes, counts by response code, top endpoints, etc.

Comments closed

Voice-Driven Workflow With Apache MiNiFi

Tim Spann shows how to take off the shelf hardware and build a voice-activated workflow:

Using a Raspberry Pi 3 with Google AIY Voice Kit is an easy way to control data flows in an organization.

The AIY Voice Kit is a good way to prototype voice ingestion. It is not perfect, but for low-end, inexpensive hardware, it is a good solution for those who speak clearly and are willing to repeat a phrase a few times for activation.

I was able to easily add a word for it to react to. I also have it waiting for someone to press the button to activate it.

It seems like a fun DIY project.

Comments closed

Streaming Analytics With Kafka

Rathnadevi Manivannan shows how to use Kafka SQL to query streaming data:

Kafka SQL, a streaming SQL engine for Apache Kafka by Confluent, is used for real-time data integration, data monitoring, and data anomaly detection. KSQL is used to read, write, and process Citi Bike trip data in real-time, enrich the trip data with other station details, and find the number of trips started and ended in a day for a particular station. It is also used to publish trip data from the source to other destinations for further analysis.

In this article, let’s discuss enriching the Citi Bike trip data and finding the number of trips on a particular day to and from a particular station.

Read on for a nice tutorial.

Comments closed

The Need For Multiple Warehouse Architectures

James Serra argues in favor of a data lake approach and a traditional data warehouse:

I think the ultimate question is: Can all the benefits of a traditional relational data warehouse be implemented inside of a Hadoop data lake with interactive querying via Hive LLAP or Spark SQL, or should I use both a data lake and a relational data warehouse in my big data solution?  The short answer is you should use both.  The rest of this post will dig into the reasons why.

I touched on this ultimate question in a blog that is now over a few years old at Hadoop and Data Warehouses so this is a good time to provide an update.  I also touched on this topic in my blogs Use cases of various products for a big data cloud solutionData lake detailsWhy use a data lake?and What is a data lake? and my presentation Big data architectures and the data lake.  

Read on for James’s argument, which is good.  My argument is summed up as follows:  the purpose of a data warehouse is to solve known business problems—that is, to help build reports that people on the business side need based on established requirements.  The purpose of a data lake is to hold all kinds of data and curate it for when people come looking for something they didn’t know they needed.

Comments closed

What’s New With KSQL

Hojjat Jafarpour announces Kafka’s KSQL version 0.3:

Additionally, we have taken the first steps to provide metrics and observability in KSQL. This greatly enhances the operability of KSQL, like in cases where you’re monitoring KSQL capacity or when diagnosing issues. You can now see different metrics for streams, tables, and queries for every KSQL server instance.

For streams and tables, we now have DESCRIBE EXTENDED <stream/table name> statement to show statistics, such as number of messages processed per second, total messages, the time when the last message was received, as well as corresponding failure metrics.

Looks like they’re building it out a piece at a time.

Comments closed

Data Skip Techniques In Impala

Mostafa Mokhtar, et al, explain a few methods for skipping unneeded data in Impala queries:

Each Apache Parquet file contains a footer where metadata can be stored including information like the minimum and maximum value for each column. Starting in v2.9, Impala populates the min_value and max_value fields for each column when writing Parquet files for all data types and leverages data skipping when those files are read. This approach significantly speeds up selective queries by further eliminating data beyond what static partitioning alone can do. For files written by Hive / Spark, Impala only reads the deprecated min and max fields.

The effectiveness of the Parquet min_value/max_value column statistics for data skipping can be increased by ordering (or clustering1) data when it is written by reducing the range of values that fall between the minimum and maximum value for any given file. It was for this reason that Impala 2.9 added the SORT BY clause to table DDL which directs Impala to sort data locally during an INSERT before writing the data to files.

Even if your answer is “throw more hardware at it,” there eventually comes a point where you run out of hardware (or budget).

Comments closed

So You *Really* Want To Monitor Kafka…

Yeva Byzek walks through Confluent Platform:

Kafka exposes hundreds of metrics. Some of them are per broker, per client, per topic, and per partition, and so the number of metrics scales up as the cluster grows. For an average-size Kafka cluster, the number of metrics very quickly bloats to the thousands.

Warning: I am about to disappoint you. You probably recognize that you realistically cannot monitor every single available metric. So you are probably hoping that in this blog post I will filter down the list of metrics to a dozen of the most critical ones, which you would then push through some generic monitoring tool, and then be done with setting up “monitoring.” However, monitoring distributed systems like Kafka is not that simple, and so there is no such list. Keep reading to understand the problems you should be solving, and how to solve them in a robust monitoring solution specifically designed for Kafka.

A common pitfall of generic monitoring tools is that they import all available metrics from a variety of systems into a metrics swamp. Even with a comprehensive list of metrics, there is a limit to what can be achieved with no Kafka context nor Kafka expertise to determine which metrics are important and which ones are not. A metrics swamp cannot produce valuable insight from the data nor provide answers to the critical business questions we asked earlier.

This is an information-dense post that you’ll want to read if you work with Apache Kafka.

Comments closed

Hadoop 3.0 Ships

Alex Woodie reports that Hadoop 3.0 is officially out there, and looks at what’s forthcoming in 3.1 and 3.2:

As we told you about last week, Hadoop 3.0 brings two big new features that are compelling in their own right. That includes support for erasure coding, which should boost storage efficiency by 50% thanks to more efficient data replication; and YARN Federation, which should allow Hadoop clusters to scale up to 40,000 nodes.

The delivery of Hadoop 3.0 shows that open open source community is responding to demands of industry, said Doug Cutting, original co-creator of Apache Hadoop and the chief architect at Cloudera.

“It’s tremendous to see this significant progress, from the raw tool of eleven years ago, to the mature software in today’s release,” he said in a press release.  “With this milestone, Hadoop better meets the requirements of its growing role in enterprise data systems.

But some of the new features in Hadoop 3.0 weren’t designed to bring immediate rewards to users. Instead, they pave the way for the Apache Hadoop community to deliver more compelling features with versions 3.1 and versions 3.2, according to  Hortonworks director of engineering Vinod Kumar Vavilapalli, who’s also a committer on the Apache Hadoop project.

“Hadoop 3.0 is actually a building block, a foundation, for more exciting things to come in 3.1 and 3.2,” he said.

Click through to see some of those exciting things.

Comments closed

So You Want To Monitor Kafka…

Ben Summer reviews a Kafka catstrophe and explains how to avoid it:

Here at New Relic, the Edge team is responsible for the pipelines that handle all the data coming into our company. We were an early adopter of Apache Kafka, which we began using to power this data pipeline. Our initial results were outstanding. Our cluster handled any amount of data we threw at it; it showed incredible fault tolerance and scaled horizontally. Our implementation was so stable for so long that we basically forgot about it. Which is to say, we totally neglected it. And then one day we experienced a catastrophic incident.

Our main cluster seized up. All graphs, charts, and dashboards went blank. Suddenly we were totally in the dark — and so were our customers. The incident lasted almost four hours, and in the end, an unsatisfactory number of customers experienced some kind of data loss. It was an epic disaster. Our Kafka infrastructure had been running like a champ for more than a year and suddenly it had ground to a halt.

This happened several years ago, but to this day we still refer to the incident as the “Kafkapocalypse.”

Ben has a couple interesting stories and some good rules of thumb for maintaining a Kafka cluster.

Comments closed

Enabling Exactly-Once Kafka Streams

Guozhang Wang wraps up his exactly-once series in Kafka:

When restarting the application from the point of failure, we would then try to resume processing from the previously remembered position in the input Kafka topic, i.e. the committed offset. However, since the application was not able to commit the offset of the processed message before crashing last time, upon restarting it would fetch A again. The processing logic will then be triggered a second time to update the state, and generate the output messages. As a result, the application state will be updated twice (e.g. from S’ to S’’) and the output messages will be sent and appended to topic TB twice as well. If, for example, your application is calculating a running count from the input data stream stored in topic TA, then this “duplicated processing” error would mean over-counting in your application, resulting in incorrect results.

Today, many stream processing systems that claim to provide “exactly-once” semantics actually depend on users themselves to cooperate with the underlying source and destination streaming data storage layer like Kafka, because they simply treat this layer as a blackbox and hence does not try to handle these failure cases at all. Application user code then has to either coordinate with these data systems—for example, via a two-phase commit mechanism—to guarantee no data duplicates, or handle duplicated records that could be generated from the clients talking to these systems when the above mentioned failure happens.

There’s some good information in here, so check it out.

Comments closed