Category: Hadoop

Apache Samza At 1.0

Published 2018-11-29 by Kevin Feasel

Jagadish Venkatraman announces Apache Samza 1.0:

We are pleased to announce today the release of Samza 1.0, a significant milestone in the history of the project. Apache Samza is a distributed stream processing framework that we developed at LinkedIn in 2013. Samza became a top-level Apache project in 2014. Fast-forward to 2018, and we currently have over 3,000 applications in production leveraging Samza at LinkedIn. The use-cases include detecting anomalies, combating fraud, monitoring performance, notifications, real-time analytics, and many more. Today, Samza integrates not only with Apache Kafka, but also with many other systems, including Azure EventHubs, Amazon Kinesis, HDFS, ElasticSearch, and Brooklin. Multiple companies like Slack, TripAdvisor, eBay, and Optimizely have adopted Samza.

We view Samza 1.0 as a step towards our vision of making stream processing universally accessible. In this post, we describe our journey in building and scaling a distributed stream processing system. We also present the key features in Samza 1.0: a rich high-level API, event-time-based processing, integration with Apache Beam, Samza SQL, a standalone mode to run Samza without YARN, and a new test framework for Samza applications.

This runs in the same space as Spark Streaming, Flink, and Kafka Streams, so there are plenty of competitors and a lot of innovation.

Comments closed

Kafka And Handling Missing Topics

Published 2018-11-29 by Kevin Feasel

The folks at Redglue show what happens when you send a message to a Kafka broker on a non-existent topic:

Now let’s produce messages to a non-existent topic called redglue_nonexistent:

root@kafka1:~# kafka-console-producer --broker-list 127.0.0.1:9092 --topic redglue_nonexistent I maybe don't exists [2018-11-28 14:22:12,454] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 1 : {redglue_nonexistent=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)

Obvious there a WARNING saying that the topic doesn’t exists, but it allows you to “send” messages to that specific topic

Read on to see what happens.

Comments closed

Spark MLflow 0.8.0 Released

Published 2018-11-27 by Kevin Feasel

Aaron Davidson and Jules Damji announce MLflow 0.8.0 on the Spark platform:

Improved MLflow UI Experience

Compact Display for Metrics and Parameters: To avoid clutter and an explosion of columns for each metric or parameter, now we group them together in a single tabular column by default. That way, each runs’ parameters and metrics are listed nearby. Users can still click each parameter or metric to display it in a separate column or sort by it and customize their view this way.
Nesting Runs: For nested MLflow runs, which are common in hyperparameter search or multi-step workflows, the UI will display a collapsible tree underneath each parent run. This makes it much easier to organize and visualize multi-step workflows.
Labeling Runs: While MLflow gives each run a UUID by default, you can also now assign each run a name through the API. These names can also be edited in the UI.
UI Persistence: The MLflow UI now remembers your filters, sorting and column setup in browser local storage so you no longer need to reconfigure the view each time.

Looks like there are some nice additions here.

Comments closed

Disaster Recovery With Kafka Deployments

Published 2018-11-27 by Kevin Feasel

Yeva Byzek walks us through a disaster recovery scenario when running Apache Kafka:

Imagine:

Disaster strikes—catastrophic hardware failure, software failure, power outage, denial of service attack or some other event causes one datacenter with an Apache Kafka^® cluster to completely fail. Yet Kafka continues running in another datacenter, and it already has a copy of the data from the original datacenter, replicated to and from the same topic names. Client applications switch from the failed cluster to the running cluster and automatically resume data consumption in the new datacenter based on where it left off in the original datacenter. The business has minimized downtime and data loss resulting from the disaster, and continues to run its mission critical applications.

Ultimately, enabling the business to continue running is what disaster recovery planning is all about, as datacenter downtime and data loss can result in businesses losing revenue or entirely halting operations. To minimize the downtime and data loss resulting from a disaster, enterprises should create business continuity plans and disaster recovery strategies.

Distributed data sources can still succumb to disaster and many of the same policies that people learn when working with relational databases apply to things like Kafka as well.

Comments closed

Using Kafka To Drive ML Predictions

Published 2018-11-23 by Kevin Feasel

Kai Waehner shows us a model architecture for using Apache Kafka to generate predictions from trained models:

Kafka applications are event based, and leverage stream processing to continuously process input data. If you’re using Kafka, then you can embed an analytic model natively in a Kafka Streams or KSQLapplication. There are various examples of Kafka Streams microservices embedding models built with TensorFlow, H2O or Deeplearning4j natively.

It is not always possible or feasible to embed analytic models directly due to architectural, security or organizational reasons. You can also choose to use RPC to perform model inference from your Kafka application (bearing in mind the the pros and cons discussed above). You can visit my project for an example of gRPC integration between a Kafka Streams microservice and locally hosted TensorFlow Serving container for making predictions with a hosted TensorFlow model.

There are a couple separate and interesting patterns here.

Comments closed

Kafka Analytics Patterns In HDP 3.1

Published 2018-11-23 by Kevin Feasel

George Vetticaden walks us through what’s coming with Apache Kafka in Hortonworks Data Platform 3.1:

A summary of these three new access patterns:

Stream Processing: Kafka Streams Support – With existing support for Spark Streaming, SAM/Storm, Kafka Streams addition provides developers with more options for their stream processing and microservice needs.
SQL Analytics: New Hive Kafka Storage Handler – View Kafka topics as tables and execute SQL via Hive with full SQL Support for joins, windowing, aggregations, etc.
OLAP Analytics: New Druid Kafka Indexing Service – View Kafka topics as cubes and perform OLAP style analytics on streaming events in Kafka using Druid.

Click through for high-level explanations of each. George promises more detailed explanations as well.

Comments closed

Deploying Cloudera Enterprise On Azure

Published 2018-11-19 by Kevin Feasel

Xavier Morera announces a new Cloudera course:

You will start by learning the Microsoft Azure services required to deploy a secure, elastic, Cloudera Enterprise cluster. These core services include security, networking, virtual machine management, and storage, just to name a few.

Then, you’ll learn best practices and patterns for cloud-based clusters, including tips and caveats for security and workload management.

Next, you’ll learn how to bootstrap a cluster using Cloudera Manager, which allows you to deploy a cluster on premises or in the cloud. The module covers how to deploy both development (Path A) and production-grade (Path B) clusters.

This is a free course, so if you’re looking for a way to fill your Thanksgiving weekend, this is definitely an option.

Comments closed

Working With The Databricks API Via Powershell

Published 2018-11-16 by Kevin Feasel

Gerhard Brueckl has a Powershell module for interacting with Databricks, either Azure or AWS:

As most of our deployments use PowerShell I wrote some cmdlets to easily work with the Databricks API in my scripts. These included managing clusters (create, start, stop, …), deploying content/notebooks, adding secrets, executing jobs/notebooks, etc. After some time I ended up having 20+ single scripts which was not really maintainable any more. So I packed them into a PowerShell module and also published it to the PowerShell Gallery (https://www.powershellgallery.com/packages/DatabricksPS) for everyone to use!

This looks like a pretty good module if you work with Databricks.

Comments closed

Kafka Connect Converters And Serialization

Published 2018-11-15 by Kevin Feasel

Robin Moffatt goes into great detail on Apache Kafka Connect converters and serialization techniques:

Kafka Connect is modular in nature, providing a very powerful way of handling integration requirements. Some key components include:

Connectors – the JAR files that define how to integrate with the data store itself

Converters – handling serialization and deserialization of data

Transforms – optional in-flight manipulation of messages

One of the more frequent sources of mistakes and misunderstanding around Kafka Connect involves the serialization of data, which Kafka Connect handles using converters. Let’s take a good look at how these work, and illustrate some of the common issues encountered.

Read on for a good overview of the topic.

Comments closed

Tuning Apache Spark Applications

Published 2018-11-14 by Kevin Feasel

Vidisha Gupta has a few tips for tuning Apache Spark programs:

Data Serialization – Serialization plays an important role in increasing the performance of any application. Spark provides two serialization libraries –

Java Serialization: By default, spark uses Java’s ObjectOutputStream framework which can work with any class that implements java.io.serializable. This serialization is flexible but slow and creates large serialized formats for many classes.
Kryo Serialization: Spark can use Kryo library to serialize objects. It is much faster and compact but does not support all serializable types. So we must register those classes which we want to be serialized. Therefore, Kryo uses indices instead of full class names to identify data types which reduce the size of the serialized data thereby increasing performance. We can initialize our spark conf by setting the value of the property spark.serializer to org.apache.spark.serializer.KryoSerializer. This serializer has a major impact on performance when we are shuffling or caching a large amount of data. To know more about this serializer, refer Kryo documentation

There are some good tips in here.

Comments closed