Press "Enter" to skip to content

Category: Hadoop

Message Buses on the Market

Alex Woodie walks us through some of the options available for message buses:

Apache Kafka

Apache Kafka is a distributed open source messaging bus that was written in Java and Scala. The software implements a publish and subscribe messaging system that’s capable of moving large amounts of event data from sources to sinks, in a high-throughput manner with minimal latency and strong consistency guarantees. The software relies on Apache Zookeeper for management of the underlying cluster.

Kafka is based on the concept of producers and consumers. Event data originating from producers is stored timestamped partitions that are housed within Kafka topics. Meanwhile, consumer processes can read the data stored in Kafka partitions. Kafka automatically replicates partitions across multiple brokers (or nodes in the cluster), which allows Kafka to scale its message streaming service in a fault-tolerant manner.

Click through for descriptions of several good options. And if you want a big list, queues.io has one for you.

Comments closed

Event-Driven Microservices

Saeed Barghi gives us an overview of what event-driven microservices are:

Modern Microservices are all about making systems event-driven: instead of making remote requests and waiting for the response (services and components calling each other and tell each other what to do), we can send notifications to related microservices when an event occurs.

These events are facts about the business. For example, an ATM or online transaction, a new log entry, or a customer registering for a new mobile plan. They are the data points collected by organizations that make their datasets. The good thing is, we can store these events in the very same infrastructure that we use to broadcast them: Apache Kafka. The better thing is we can even process them in the same infrastructure with Stream Processing applications. This means our applications and systems are linked via this central data pipeline, that is capable of real time data broadcast and processing and all data sources are shared via this data pipeline.

Read the whole thing.

Comments closed

From pandas to Spark with koalas

Achilleus tries out Koalas:

Python is widely used programming language when it comes to Data science workloads and Python has way too many different libraries to back this fact. Most of the data scientists are familiar with Python and pandas mostly. But the main issue with Pandas is it works great for small and medium datasets but not so good on big-data workloads. The challenge now becomes to convert the existing pandas code to pyspark code. This is just not straight forward and has a lot of performance hits if python UDFs are used without much care.

Koalas tries to address the first problem ie lessen the friction of learning different APIs to port their existing Pandas code to Pyspark. With Koalas, we can just directly replace the existing pandas code with Koalas. As far as the performance goes, there are no numbers yet as it is still in the initial phase of the project. But this definitely looks promising though.

Read on for some initial thoughts on the product, including a few gotchas.

Comments closed

Overriding Spark Dependencies

Landon Robinson shows how to override a Spark dependency located on the classpath:

This doesn’t draw the line exactly where the method changed from private to public, but generally speaking:
– gson-2.2.4.jar: the method is private, and therefore too old for use here
– gson-2.6.1: the method is public, and works fine.
Somewhere between the two, the method’s status changed.

So, because I had some functionality that required the method be public and accessible, it was important I specify the right version in my dependency manager (SBT). “That’s easy,” I thought. “No problem.”

Spoilers: there was a problem.

Comments closed

Kafka and MirrorMaker

Renu Tewari describes what MirrorMaker does for Kafka today and what is coming with version 2:

Apache Kafka has become an essential component of enterprise data pipelines and is used for tracking clickstream event data, collecting logs, gathering metrics, and being the enterprise data bus in a microservices based architectures. Kafka is essentially a highly available and highly scalable distributed log of all the messages flowing in an enterprise data pipeline. Kafka supports internal replication to support data availability within a cluster. However, enterprises require that the data availability and durability guarantees span entire cluster and site failures.

The solution, thus far, in the Apache Kafka community was to use MirrorMaker, an external utility, that helped replicate the data between two Kafka clusters within or across data centers. MirrorMaker is essentially a Kafka high-level consumer and producer pair, efficiently moving data from the source cluster to the destination cluster and not offering much else. The initial use case that MirrorMaker was designed for was to move data from clusters to an aggregate cluster within a data center or to another data center to feed batch or streaming analytics pipelines. Enterprises have a much broader set of  use cases and requirements on replication guarantees.

Read on for the list of benefits and upcoming features.

Comments closed

Collecting Hadoop Metrics from Multiple Clusters

Dmitry Tolpeko shows how you can collate Hadoop metrics from several ElasticMapReduce clusters:

The first step is to dynamically get the list of clusters and their IPs. Hadoop clusters are often reprovisioned, added and terminated, so you cannot use the static list and addresses. In case of Amazon EMR, you can use the following Linux shell command to get the list of active clusters:

aws emr list-clusters --active

From its output you can get the cluster IDs and names. As a cluster ID and IP can change over time, its name is usually permanent (like DEV or Adhoc-Analytics cluster) so it can be useful for various aggregation reports.

Read on to see what you can do with this list of clusters.

Comments closed

Apache Avro 1.9.0 Released

Fokko Driesprong announces the release of Apache Avro 1.9.0:

Avro is a remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. If you’re unfamiliar with Avro, I would highly recommend the explanation of Dennis Vriend at Binx.io about an introduction into Avro.

Over 272 Jira tickets have been resolved, and 844 PRs are included since 1.8.2. I’d like to point out several major changes.

That’s a lot of tickets.

Comments closed

Temporal Tables with Flink

Marta Paes shows off a new feature in Apache Flink:

In the 1.7 release, Flink has introduced the concept of temporal tables into its streaming SQL and Table API: parameterized views on append-only tables — or, any table that only allows records to be inserted, never updated or deleted — that are interpreted as a changelog and keep data closely tied to time context, so that it can be interpreted as valid only within a specific period of time. Transforming a stream into a temporal table requires:

– Defining a primary key and a versioning field that can be used to keep track of the changes that happen over time;
– Exposing the stream as a temporal table function that maps each point in time to a static relation.

It looks pretty good.

Comments closed

Auto-Terminating Unused EMR Clusters

Praveen Krishamoorthy Ravikumar shows how you can use AWS Lambda to terminate ElasticMapReduce clusters which have been idle for a certain amount of time:

To avoid this overhead, you must track the idleness of the EMR cluster and terminate it if it is running idle for long hours. There is the Amazon EMR native IsIdle Amazon CloudWatch metric, which determines the idleness of the cluster by checking whether there’s a YARN job running. However, you should consider additional metrics, such as SSH users connected or Presto jobs running, to determine whether the cluster is idle. Also, when you execute any Spark jobs in Apache Zeppelin, the IsIdle metric remains active (1) for long hours, even after the job is finished executing. In such cases, the IsIdle metric is not ideal in deciding the inactivity of a cluster.

In this blog post, we propose a solution to cut down this overhead cost. We implemented a bash script to be installed in the master node of the EMR cluster, and the script is scheduled to run every 5 minutes. The script monitors the clusters and sends a CUSTOM metric EMR-INUSE (0=inactive; 1=active) to CloudWatch every 5 minutes. If CloudWatch receives 0 (inactive) for some predefined set of data points, it triggers an alarm, which in turn executes an AWS Lambda function that terminates the cluster.

We went a slightly different route for auto-termination, killing after a fixed number of hours.

Comments closed